Tech Articles

May 23, 2026 Apache Pinot ClickHouse Apache Pinot Apache Druid

The article explores the differences between stream processing and real-time OLAP systems, emphasizing their complementary roles in modern data architectures. It discusses when to use Apache Flink for continuous data transformations versus using OLAP databases like ClickHouse and Apache Pinot for interactive analytics. The author highlights the importance of a durable event streaming backbone, such as Apache Kafka, to connect these systems effectively. Key takeaways include understanding computation boundaries and avoiding common pitfalls in architecture design.

A Meshy Approach to Data: Enabling 100+ Teams to Build Data Models

May 20, 2026 dbt

The article discusses Monzo's innovative approach to data modeling, emphasizing a decentralized ownership model that empowers over 100 teams to contribute to a shared data warehouse. It outlines the challenges faced during rapid growth, such as performance optimization and lack of shared conventions, and introduces a structured architecture with defined modeling layers and automated quality enforcement. The results so far indicate significant improvements in cost reduction and data delivery times. Monzo's experience serves as a guide for organizations facing similar data management challenges.

Data agents: When enterprise analytics learns to reason

May 20, 2026

The article discusses the concept of data agents, which are advanced analytics systems designed to interpret intent, monitor changes, and assist in decision-making within enterprises. Unlike traditional Business Intelligence (BI) tools that passively report data, data agents actively engage in the decision-making process by providing insights and recommendations based on governed metrics. The author outlines the operational differences between data agents and conventional analytics, shares a case study on implementing a supply chain data agent, and emphasizes the importance of trust and governance in deploying such systems.

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

May 19, 2026

The article discusses how Meta leveraged AI to enhance the understanding and navigation of its large-scale data pipelines, which span multiple repositories and programming languages. By implementing a pre-compute engine with specialized AI agents, they were able to document tribal knowledge that was previously undocumented, resulting in a significant increase in AI context coverage and efficiency. The approach involved a structured orchestration of tasks among various AI agents to create concise context files that guide developers in navigating complex codebases. The system is designed to maintain itself, ensuring that the context remains relevant and up-to-date.

From Zero to a RAG System: Successes and Failures

May 19, 2026 Streamlit

The article details the author's journey in building a Retrieval-Augmented Generation (RAG) system using a local language model for internal company use. It outlines the challenges faced, including selecting the right technology stack, managing chaotic document sources, and optimizing the indexing process. The author shares insights on overcoming memory issues, utilizing a dedicated vector database, and ensuring a smooth user experience through a well-designed API and frontend. Ultimately, the project culminated in a functional system that meets the company's needs for fast, relevant information retrieval.

Beyond ETL: The Case for Context

May 18, 2026

The article discusses the limitations of traditional ETL processes in addressing the meaning of data, emphasizing the need for a Context Store to provide semantic clarity. It introduces the ECL framework (Extract, Contextualize, Link) as a new approach to data engineering that prioritizes meaning over mere data movement. The author highlights the accountability gap in transformation logic and the erosion of context through data pipelines. The piece advocates for recognizing existing infrastructure patterns to build a Context Store that meets the needs of modern autonomous agents.

The Power of Data Sketches: A Comprehensive Guide

May 17, 2026 AWS Neptune AWS DynamoDB

This article explores the challenges of counting unique items in large datasets and introduces probabilistic data structures known as data sketches. It discusses how traditional methods for exact counting can be resource-intensive and slow, especially at scale. The author explains various types of sketches, their advantages, and how they can provide approximate answers quickly while maintaining acceptable error bounds. The article emphasizes the importance of understanding when approximation is appropriate and the architectural implications of using sketches in data engineering.

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

May 17, 2026 dbt Apache Spark

This article discusses Spotify's implementation of background coding agents, codenamed 'Honk', to automate the migration of dataset consumers to new versions of datasets. It highlights the challenges faced during the migration process, particularly the need for context engineering and the lack of standardization across different data pipeline frameworks. The article also shares insights on how tools like Backstage and Fleet Management were utilized to streamline the migration process, ultimately saving significant engineering time. Future improvements for Honk are also discussed, including enhanced context gathering capabilities.

How We Built an AI Second Brain for 60K Knowledge Workers

May 16, 2026

The article discusses the development of an AI Second Brain at Meta, designed to assist knowledge workers by providing structured access to their work context across various platforms. It outlines the architecture, including the PARA workspace method, infrastructure layer, execution engine, and reusable skills. The project has seen rapid adoption, with over 63,000 users leveraging the AI to streamline workflows and enhance productivity. Key lessons learned include the importance of infrastructure, progressive disclosure of information, and community-driven feature development.

Building Self-Healing Data Pipelines at Halodoc

May 15, 2026

The article discusses the challenges of maintaining data pipelines and introduces a multi-layer self-healing system designed to automate recovery from various failure modes. It outlines six targeted recovery layers, each addressing specific issues such as CDC auto-recovery, source-vs-lake consistency, mini-batch processing, smart memory scaling, warehouse lock management, and cascading dependency recovery. The implementation of these mechanisms has significantly reduced the mean time to recover from failures and minimized manual interventions, allowing engineers to focus on more critical tasks. The article emphasizes the importance of transparency in alerting while automating recovery processes.

How Informatica Built a Multi-Agent AI System to Reduce Data Workflows from Months to Days

May 15, 2026

The article discusses how Neha Awasthi and her team at Informatica developed CLAIRE, a multi-agent AI system designed to streamline enterprise data workflows within the Intelligent Data Management Cloud (IDMC). By moving beyond single-agent AI systems, CLAIRE achieves a 90% task success rate and reduces complex workflows from months to days. The article highlights the challenges faced in orchestrating multiple agents, managing dependencies, and ensuring execution reliability, while also detailing the innovative solutions implemented to enhance data management processes.

Building Event-Driven Data Agents with BigQuery, Pub/Sub, and ADK

April 27, 2026 GCP BigQuery Apache Paimon Ray Data

The article discusses the implementation of event-driven data agents using Google Cloud technologies like BigQuery, Pub/Sub, and the ADK on Vertex AI. It emphasizes the importance of real-time data processing to address issues such as financial fraud and supply chain disruptions. The architecture allows for immediate anomaly detection and autonomous investigation, thereby reducing the need for manual intervention. The article provides a detailed explanation of the components involved, including continuous queries in BigQuery and Single Message Transforms in Pub/Sub.

Inside Meta’s Home Grown AI Analytics Agent

April 26, 2026

The article discusses the development and deployment of Meta's AI Analytics Agent, which automates routine data analysis tasks traditionally performed by data scientists. It highlights the agent's ability to learn from user query history and contextualize data analysis, resulting in increased efficiency and user adoption. The article emphasizes the importance of personalized context and transparency in analytics, showcasing how the agent has evolved from a prototype to a widely used tool within the organization. Key insights include the repetitive nature of data work and the significance of shared memory in enhancing the agent's capabilities.

The Markdown File That Beat a $50M Vector Database

April 1, 2026

This article explores how three successful AI agent platforms—Manus, OpenClaw, and Claude Code—utilize plain Markdown files as their primary memory systems instead of traditional vector databases. It discusses the architectural choices made by these platforms, highlighting the advantages of file-based memory in terms of cost efficiency and operational simplicity. The author argues that while files are effective for single-user workflows, they have limitations in concurrency and semantic retrieval at scale. The article concludes by suggesting that starting with a Markdown file can be a practical approach for new AI agents.

State of Context Engineering in 2026

March 31, 2026

The article discusses the evolution of context engineering as a critical discipline in AI engineering, highlighting five key patterns for managing context effectively. It emphasizes the importance of context in optimizing the performance of large language models (LLMs) and addresses challenges such as context bloat and the need for efficient retrieval strategies. The author shares insights from previous works and workshops, providing a comprehensive overview of how context engineering has matured over the past year. The article serves as a guide for AI engineers looking to enhance their systems through improved context management.

Reimagine Marketing at Volkswagen Group with Generative AI

March 31, 2026 AWS Bedrock AWS Sagemaker

The article discusses how Volkswagen Group collaborated with AWS to implement a generative AI solution that enhances their marketing capabilities. By leveraging Amazon SageMaker and Amazon Bedrock, they developed an end-to-end pipeline for generating and evaluating marketing images that comply with brand standards. The solution addresses challenges in producing high-quality, brand-compliant images at scale, automating both image generation and quality control processes. It highlights the importance of fine-tuning AI models to reflect specific brand identities and the use of automated systems for prompt optimization and brand guideline compliance.

Building a modern lakehouse architecture: Yggdrasil Gaming’s journey from BigQuery to AWS

March 24, 2026 AWS Athena AWS Database Migration Service Apache Iceberg AWS Glue AWS Redshift

The article details Yggdrasil Gaming's migration from Google BigQuery to an AWS-based lakehouse architecture, highlighting the challenges faced due to multi-cloud operational complexity and the need for a scalable analytics foundation. It outlines the phased approach taken to establish a new architecture using AWS services, including Amazon S3, Apache Iceberg, and Amazon Athena, which enabled real-time data ingestion and advanced analytics capabilities. The migration resulted in significant cost savings, improved data freshness, and enhanced governance for analytics workloads. The article serves as a case study for organizations looking to modernize their data architecture.

Lower your warehouse costs via DuckDB transpilation

March 23, 2026 DuckDB SqlMesh GCP BigQuery

The article discusses how DuckDB can be used to reduce warehouse costs by enabling transpilation of SQL queries from traditional data warehouses like BigQuery to DuckDB. It introduces a feature called 'quack mode' in the lea data orchestrator, which allows users to run queries locally on DuckDB while still pulling necessary data from their existing warehouse. The author emphasizes the advantages of this approach in terms of cost efficiency and flexibility in data orchestration. Additionally, it touches on the integration of DuckDB with Iceberg for improved data management.

How I Structure My Data Pipelines

March 22, 2026 Databricks Unity Catalog AWS Database Migration Service dbt

The article discusses the author's approach to structuring data pipelines by integrating the medallion architecture, Kimball dimensional modeling, and semantic layers. It emphasizes the importance of defining clear roles and outputs for each layer—Bronze, Silver, and Gold—to cater to different user needs. The author argues for making the semantic layer a first-class priority in data architecture, highlighting its role in providing governed metrics for self-service analytics. The article concludes with a concrete example of how marketing attribution data flows through this architecture.

Evaluating AI agents: Real-world lessons from building agentic systems at Amazon

March 22, 2026 AWS Bedrock

The article discusses the evolution of generative AI applications into agentic AI systems at Amazon, highlighting the need for a comprehensive evaluation framework. It emphasizes the importance of assessing not just individual model performance but also the emergent behaviors of the entire system. The authors present a detailed evaluation methodology that includes automated workflows and a library of metrics tailored for agentic AI applications. Best practices and lessons learned from real-world implementations are shared to guide developers in evaluating and deploying these complex systems effectively.

Scaling PostgreSQL to power 800 million ChatGPT users

March 19, 2026 Apache Pinot Postgres DB

The article discusses how OpenAI has successfully scaled PostgreSQL to handle the demands of 800 million ChatGPT users, achieving millions of queries per second. It outlines the challenges faced during high write traffic, the optimizations implemented, and the architectural decisions made to maintain performance and reliability. Key strategies include offloading read traffic, optimizing queries, and managing workloads to prevent service degradation. The article also highlights the importance of connection pooling and caching to enhance database efficiency.

Memory: How Agents Learn

March 18, 2026 Data Prep Kit Ray Data

The article discusses the importance of memory in AI agents, particularly how it enables them to learn from past interactions and improve their performance over time. It categorizes memory into three types: session memory, user memory, and learned memory, each with distinct characteristics and benefits. The author provides code examples for implementing these memory types in agents, emphasizing the significance of learned memory for enhancing agent capabilities. The article concludes with a discussion on what constitutes a good learning and the need for human oversight in the learning process.

Building a conversational agent in BigQuery using the Conversational Analytics API

March 17, 2026 GCP BigQuery

This article provides a comprehensive guide on building a conversational agent in BigQuery using the Conversational Analytics API. It outlines the steps to configure the agent, create conversations, and manage interactions with users. The API enables users to query BigQuery data using natural language, facilitating real-time insights and dynamic reporting. The article emphasizes the importance of clear system instructions and schema descriptions to enhance the agent's effectiveness.

Demystifying evals for AI agents

March 17, 2026

The article discusses the complexities of evaluating AI agents, emphasizing the importance of rigorous evaluations (evals) throughout the agent lifecycle. It outlines various evaluation structures, types of graders, and the significance of early and continuous eval development. The piece highlights the challenges faced by teams without evals, which can lead to reactive development cycles. It also provides insights into different agent types and their evaluation techniques, ultimately advocating for a systematic approach to agent evaluation to enhance performance and reliability.

The state of data mesh in 2026: From hype to hard-won maturity

March 16, 2026

The article discusses the evolution of data mesh from a concept filled with hype to a mature socio-technical paradigm by 2026. It emphasizes the challenges organizations face in implementing data mesh, particularly in changing organizational behaviors and aligning data initiatives with business strategies. The authors share insights on the four core principles of data mesh: domain ownership, treating data as a product, self-serve data platforms, and federated computational governance. The article concludes that successful data mesh implementations require a long-term commitment to organizational transformation rather than merely adopting new technologies.

Optimizing Flink’s join operations on Amazon EMR with Alluxio

March 15, 2026 Alluxio Apache Flink

The article discusses the challenges of correlating real-time data with historical data in data analysis, particularly in e-commerce scenarios. It presents an optimized solution using Apache Flink to join streaming order data with historical customer and product information, leveraging Alluxio for caching. The implementation details include using Hive dimension tables and Flink's temporal joins to enhance performance and reduce bottlenecks. The article also addresses state management issues in Flink applications and provides insights into improving data processing efficiency.

Reimagining LinkedIn’s Search Tech Stack

March 14, 2026 Apache Arrow Apache Spark

The article discusses LinkedIn's transformation of its search technology stack, focusing on the integration of large language models (LLMs) to enhance search experiences. It details the challenges and innovations involved in deploying LLMs at scale, including query understanding, semantic retrieval, and ranking processes. The use of AI-driven job and people search features aims to provide more relevant and personalized results. Additionally, the article highlights the importance of continuous relevance measurement and quality evaluation in maintaining a high-quality search experience.

Unified Context-Intent Embeddings for Scalable Text-to-SQL

March 13, 2026 AWS OpenSearch Apache Airflow Apache Spark

This article discusses Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for improved SQL generation and table discovery. The system addresses the challenges of understanding analytical intent and provides a structured approach to data governance and documentation. By encoding historical query patterns and utilizing AI-generated documentation, the agent enhances the efficiency and reliability of data analytics at Pinterest. The article outlines the architecture and operational principles behind the agent's design, emphasizing the importance of context and governance in AI-driven analytics.

The next evolution of Delta - Catalog-Managed Tables

March 8, 2026 Databricks Unity Catalog Delta Lake

The article discusses the introduction of catalog-managed tables in Delta Lake 4.1.0, which shift the management of table access and metadata from the filesystem to a catalog-centric model. This change aims to simplify table discovery, enhance governance, and improve performance by allowing clients to reference tables by name rather than by path. The article also highlights the challenges faced with filesystem-managed tables and how catalog-managed tables address these issues, paving the way for a more interoperable and efficient data ecosystem.

Engineering LinkedIn's Job Ingestion System at Scale

March 7, 2026

The article discusses the architecture and challenges of LinkedIn's job ingestion system, which processes millions of job postings daily from diverse sources. It highlights the importance of reliability, scalability, and extensibility in handling heterogeneous job data feeds. The system employs a modular, event-driven pipeline that includes job intake and processing stages, utilizing various methods for data extraction and transformation. The article emphasizes the need for robust security protocols and maintaining data quality to ensure a trustworthy job catalog for users.

Your Agents Need Runbooks, Not Bigger Context Windows

March 7, 2026

The article discusses the necessity of operational memory in AI agents, particularly in high-stakes environments where reliability is crucial. It critiques the current reliance on large context windows and highlights the inefficiencies of existing memory architectures. The author proposes a Context File System (CFS) that separates reasoning from execution, allowing agents to build a library of proven procedures. This shift aims to enhance automation and reduce costs in enterprise settings.

Zero-ETL Integrations with Amazon OpenSearch Service

March 7, 2026 AWS OpenSearch AWS DynamoDB

The article discusses the capabilities of Amazon OpenSearch Service, focusing on its zero-ETL integrations with various AWS services. It highlights how these integrations simplify data access and analysis by eliminating the need for complex ETL pipelines. The article covers specific integrations with services such as Amazon S3, CloudWatch, DynamoDB, RDS, Aurora, and DocumentDB, detailing their features, benefits, and best practices. Overall, it emphasizes the operational efficiency and innovation acceleration that zero-ETL integrations can provide for real-time analytics and search applications.

Unified Data Discovery with Business Context in Unity Catalog

March 7, 2026 Databricks Unity Catalog

The article discusses the challenges organizations face in finding and verifying data across analytics and AI workflows. It introduces Databricks' new Discover experience, which integrates business context and trust into the Unity Catalog, allowing users to find and access trusted data and AI assets more efficiently. The article highlights the importance of domains, intelligent curation, and governed access in facilitating a unified discovery experience that enhances user confidence and reduces bottlenecks in data access.

DataJunction as Netflix’s answer to the missing piece of the modern data stack

March 5, 2026 Data Junction

The article discusses Netflix's challenges in managing metrics within its experimentation platform and how DataJunction, an open-source metric platform, addresses these issues. It highlights the importance of a centralized semantic layer for defining metrics and dimensions, which simplifies the onboarding process for data scientists and analytics engineers. The authors detail the architecture and design decisions behind DataJunction, emphasizing its SQL parsing capabilities and integration with existing tools. The article concludes with plans for further integration and unification of analytics at Netflix.

Under the hood: an introduction to the Native Execution Engine for Microsoft Fabric

March 4, 2026 Apache Spark

The article introduces the Native Execution Engine for Microsoft Fabric, designed to enhance Apache Spark's performance without requiring code changes. It explains the challenges faced by traditional Spark execution due to increasing data volumes and real-time processing demands. The Native Execution Engine leverages C++ and vectorized execution to optimize Spark workloads, particularly for columnar data formats like Parquet and Delta Lake. The integration of open-source technologies Velox and Apache Gluten is highlighted, showcasing significant performance improvements and cost savings for users.

Balancing Cost and Reliability for Spark on Kubernetes

March 4, 2026 Apache Spark Apache Paimon

The article discusses the development and implementation of Spot Balancer, a tool created by Notion in collaboration with AWS, which optimizes the use of Spark on Kubernetes by balancing cost and reliability. It highlights the challenges faced when using Spot Instances for Spark jobs and how Spot Balancer allows for better control over executor placement to prevent job failures. The article outlines the transition from Amazon EMR to EMR on EKS and the benefits of dynamic provisioning and efficient resource management. Ultimately, the tool has helped Notion reduce Spark compute costs by 60-90% without sacrificing reliability.

Our Multi-Agent Architecture for Smarter Advertising

February 28, 2026 Apache Pinot dbt Apache Spark Ray Data

The article discusses Spotify's innovative multi-agent architecture designed to enhance its advertising platform. By addressing the fragmented decision-making processes across various advertising channels, the architecture aims to unify workflows and optimize campaign management through specialized AI agents. This approach allows for more efficient budget allocation, audience targeting, and overall campaign performance, leveraging historical data and machine learning. The article highlights the importance of a programmable decision layer and the challenges faced in implementing this system.

AI Agent-Driven Browser Automation for Enterprise Workflow Management

January 22, 2026 AWS Bedrock AWS EventBridge

The article discusses the challenges enterprises face with manual workflows across multiple web applications and introduces AI agent-driven browser automation as a solution. It highlights how AI agents can intelligently navigate complex workflows, reduce manual intervention, and improve operational efficiency. The article provides a detailed example of an e-commerce order management system that utilizes Amazon Bedrock and AI agents for automating order processing across various retailer websites. It emphasizes the importance of human oversight in handling exceptions and maintaining compliance.

Orchestrating Success

January 21, 2026 Apache Airflow dbt

The article discusses Vinted's journey in standardizing large-scale decentralized data pipelines as they migrated their data infrastructure to the cloud. Initially, teams operated independently, but as dependencies grew, coordination became challenging. To address this, they developed a DAG generator that abstracts pipeline creation and standardizes dependency interactions, allowing teams to focus on data models rather than orchestration details. This approach improved visibility and reduced operational complexity across decentralized teams.

Building a multi-agent pipeline for NL-to-SQL analytics

January 21, 2026

This article discusses the development of a multi-agent architecture for a natural language to SQL (NL-to-SQL) analytics system. It highlights the limitations of a monolithic MCP-based system and presents a new A2A (agent-to-agent) pipeline that improves stability, scalability, and error isolation. The article details the roles of specialized agents in the pipeline and how they collaborate to convert user queries into accurate SQL. Additionally, it emphasizes the importance of a structured data mart for efficient analytics execution.

Build a generative AI-powered business reporting solution with Amazon Bedrock

January 16, 2026 AWS Bedrock AWS DynamoDB

This article presents a solution for automating business reporting using generative AI and Amazon Bedrock. It highlights the inefficiencies of traditional reporting processes and introduces a serverless architecture that leverages AWS services to streamline report writing and enhance internal communication. The solution includes a user-friendly interface for associates and managers, enabling efficient report generation and submission. Additionally, it addresses challenges such as data management and risk mitigation associated with AI implementation.

How Slack achieved operational excellence for Spark on Amazon EMR using generative AI

January 15, 2026 AWS Bedrock Apache Spark AWS EventBridge Ray Data

The article discusses how Slack developed a comprehensive metrics framework to enhance the performance and cost-efficiency of their Apache Spark jobs on Amazon EMR. By integrating generative AI and custom monitoring tools, they achieved significant improvements in job completion times and cost reductions. The framework captures over 40 metrics, providing granular insights into application behavior and resource usage. The article outlines the architecture of their monitoring solution and the benefits of AI-assisted tuning for Spark operations.

Introducing Visa Intelligent Commerce on AWS: Enabling agentic commerce with Amazon Bedrock AgentCore

January 15, 2026 AWS Bedrock Apache Paimon

The article discusses the collaboration between AWS and Visa to introduce Visa Intelligent Commerce, which leverages Amazon Bedrock AgentCore to enable agentic commerce. This new approach allows for seamless, autonomous payment experiences that reduce manual intervention in transactions. The article explains how intelligent agents can handle multi-step tasks in various sectors, particularly in payments and shopping, transforming traditional workflows into more efficient, outcome-driven processes. It also highlights the technical architecture and tools involved in building these agentic workflows.

Data Engineering in 2026: What Changes?

January 14, 2026 Ray Data PromptWizard Data Prep Kit

The article discusses the evolving landscape of data engineering as it adapts to the needs of AI agents in an increasingly automated environment. It emphasizes the importance of building reliable, code-first data platforms that can handle multimodal data and provide context for agents. The shift from traditional data engineering tasks to high-level system supervision is highlighted, along with the necessity for safety and correctness in data pipelines. Ultimately, the article envisions a future where humans and AI agents collaborate seamlessly, transforming data engineering practices.

State of Large Language Models in 2025

January 13, 2026

The article reflects on the significant advancements in large language models (LLMs) throughout 2025, highlighting key developments such as reasoning models, reinforcement learning with verifiable rewards (RLVR), and the GRPO algorithm. It discusses the evolving landscape of LLM architectures, the importance of inference scaling, and the challenges of benchmarking in the field. The author shares predictions for future trends in LLM development, emphasizing the need for continual learning and domain specialization. Overall, it provides a comprehensive overview of the state of LLMs and their implications for various industries.

Why We Use Separate Tech Stacks for Personalization and Experimentation

January 8, 2026

The article discusses the importance of separating tech stacks for personalization and experimentation at Spotify. It explains how personalized applications enhance user experiences by tailoring content to individual preferences using advanced machine learning models. The distinction between personalization and experimentation is highlighted, emphasizing the need for different infrastructures and methodologies for each. The article also outlines the benefits of this separation in terms of scalability and efficiency in evaluating recommendation systems.

Unifying Governance and Metadata Across Amazon SageMaker Unified Studio and Atlan

January 4, 2026 Atlan AWS Sagemaker

This article discusses the integration of Atlan and Amazon SageMaker Unified Studio to unify governance and metadata management across data and AI environments. It highlights the importance of maintaining consistent metadata in hybrid environments where different teams use various tools. The article provides a detailed overview of the integration process, including setting up secure connections and automated synchronization of metadata. It emphasizes the benefits of having a single, trusted view of data assets for both business and technical users.

Agentic AI: Engineering Reliability in Operational Systems

December 29, 2025

At QCon AI NYC 2025, Aaron Erickson presented agentic AI as an engineering challenge rather than a simple prompt crafting task. He emphasized the importance of combining probabilistic components with deterministic boundaries to enhance reliability. The article discusses the role of specialized agents and deterministic tools in operational systems, highlighting the need for structured outputs and effective tool selection. Erickson's insights provide a framework for understanding the complexities of deploying AI in real-world applications.

Unlocking Entertainment Intelligence with Knowledge Graph

November 27, 2025

The article discusses Netflix's implementation of an Entertainment Knowledge Graph, which unifies disparate entertainment datasets into a cohesive ecosystem. This ontology-driven architecture enhances analytics, machine learning, and strategic decision-making by providing semantic connectivity and conceptual consistency. It addresses challenges in traditional data management by allowing rapid integration of new data types and relationships, ultimately improving insights into the entertainment landscape. The article outlines the architecture, use cases, and future outlook of the knowledge graph.

How Care Access achieved 86% data processing cost reductions and 66% faster data processing with Amazon Bedrock prompt caching

November 25, 2025 AWS Bedrock AWS S3

The article discusses how Care Access, a healthcare organization, utilized Amazon Bedrock's prompt caching feature to significantly reduce data processing costs by 86% and improve processing speed by 66%. By caching static medical record content while varying analysis questions, Care Access optimized their operations to handle large volumes of medical records efficiently while maintaining compliance with healthcare regulations. The implementation details, including the architecture and security measures, are also highlighted, showcasing the transformative impact of this technology on their health screening program.

650GB of Data (Delta Lake on S3). Polars vs DuckDB vs Daft vs Spark.

November 24, 2025 Apache Spark DuckDB DuckDB

The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.

Understanding Spark on YARN: Resource Management and Communication

November 23, 2025 Apache Spark

This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.

Simple Queries in Spark Catalyst Optimisation (1)

November 23, 2025 Apache Spark

This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.

Simple Queries in Spark Catalyst Optimisation (2) Join and Aggregation

November 23, 2025 Apache Spark

This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.

Introducing Contextual Retrieval

November 23, 2025

The article introduces a new method called Contextual Retrieval, which enhances the traditional Retrieval-Augmented Generation (RAG) approach by improving the retrieval step through two sub-techniques: Contextual Embeddings and Contextual BM25. This method significantly reduces retrieval failures and improves the accuracy of AI models in specific contexts. The article also discusses the importance of context in information retrieval and provides insights into implementing this technique using Claude, along with considerations for performance optimization and cost reduction.

The 'Think' Tool: Enabling Claude to Stop and Think in Complex Tool Use Situations

November 23, 2025

This article discusses the introduction of the 'think' tool for Claude, which enhances its ability to solve complex problems by allowing it to pause and reflect during multi-step tasks. Unlike the 'extended thinking' capability, the 'think' tool focuses on processing new information and ensuring compliance with policies. The article provides practical guidance for implementing the tool, backed by performance evaluations showing significant improvements in customer service scenarios. It emphasizes the importance of strategic prompting and offers best practices for effective use.

Effective Context Engineering for AI Agents

November 23, 2025

The article discusses the emerging field of context engineering for AI agents, emphasizing the importance of managing context as a finite resource. It contrasts context engineering with prompt engineering, highlighting the need for strategies that optimize the utility of tokens during LLM inference. The article explores techniques such as compaction, structured note-taking, and sub-agent architectures to maintain coherence over long-horizon tasks. It concludes by stressing the significance of thoughtful context curation to enhance agent performance.

Building a reactive Fraud Prevention Platform

November 22, 2025

The article discusses the redesign of Monzo's Fraud Prevention Platform, emphasizing the challenges of detecting and preventing fraud in a fast-paced environment. It outlines the complexities of fraud detection, including the sophistication and speed of fraudsters, and the need for a balance between user experience and security. The system design is explained in detail, highlighting the use of machine learning models and a microservices architecture to monitor and respond to fraudulent activities effectively.

Iceberg REST Catalog Now Supported in BigLake Metastore for Open Data Interoperability

November 20, 2025 Apache Iceberg Apache Spark GCP BigQuery GCP BigQuery metastore

Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.

What it means to get your data ready for AI

November 19, 2025

The article discusses the evolving role of data engineers in the context of Agentic AI, highlighting how the interaction with data is shifting from a builder-centric model to a more user-driven approach. It emphasizes the importance of rethinking traditional ETL/ELT processes, prioritizing data curation over mere collection, and building infrastructure that supports AI agents. The author outlines five key principles for data engineers to adapt to these changes, focusing on context-aware data handling and the management of AI-generated artifacts.

LyftLearn Evolution: Rethinking ML Platform Architecture

November 18, 2025 AWS Sagemaker

The article discusses Lyft's transition from a fully Kubernetes-based machine learning platform to a hybrid architecture utilizing AWS SageMaker for offline workloads and Kubernetes for online model serving. It highlights the challenges faced with the original architecture, including operational complexity and resource management, and details the technical decisions made to simplify the infrastructure while maintaining performance. The migration aimed to reduce operational overhead and improve reliability, allowing teams to focus on developing new capabilities rather than managing infrastructure.

Accelerating generative AI applications with a platform engineering approach

November 18, 2025 AWS Bedrock AWS DynamoDB Apache DataFusion

The article discusses how organizations can leverage platform engineering principles to accelerate the development and deployment of generative AI applications. It highlights the challenges faced by organizations in experimenting with generative AI and emphasizes the importance of building reusable components to manage costs and improve efficiency. The article outlines the architecture of generative AI applications, including the integration of various data layers and the role of large language models. It also covers best practices for observability, orchestration, and governance in AI workflows.

Accelerate enterprise solutions with agentic AI-powered consulting: Introducing AWS Professional Service Agents

November 17, 2025 AWS Bedrock AWS Database Migration Service

The article introduces AWS Professional Services' new approach to consulting, leveraging agentic AI to enhance cloud adoption and digital transformation for organizations. It highlights the role of specialized AI agents in streamlining consulting processes, improving solution quality, and reducing project timelines. The integration of AI with human expertise is emphasized as a means to deliver better customer outcomes. Real-world examples, including the NFL's use of AWS agents, illustrate the tangible benefits of this innovative consulting model.

Reduce CAPTCHAs for AI agents browsing the web with Web Bot Auth (Preview) in Amazon Bedrock AgentCore Browser

November 16, 2025 AWS Bedrock

The article discusses the challenges AI agents face when browsing the web, particularly with CAPTCHAs and other bot detection mechanisms. It introduces Amazon Bedrock AgentCore Browser's new feature, Web Bot Auth, which provides AI agents with verifiable cryptographic identities to reduce CAPTCHA friction. The article explains how this protocol works and its collaboration with WAF providers to ensure secure access for verified bots. It highlights the benefits for both AI agents and website owners in managing automated traffic.

Analyzing Amazon EC2 Spot instance interruptions by using event-driven architecture

November 16, 2025 AWS EventBridge AWS OpenSearch

This article discusses the challenges associated with using Amazon EC2 Spot Instances within Auto Scaling Groups due to their unpredictable interruptions. It presents a solution in the form of a custom event-driven monitoring and analytics dashboard, named 'Spot Interruption Insights', which provides near real-time visibility into Spot Instance interruptions. The article outlines a step-by-step guide to building this monitoring solution using various AWS services, including Amazon EventBridge, SQS, Lambda, and OpenSearch Service, to optimize capacity planning and improve workload resilience.

Getting AI to write good SQL: Text-to-SQL techniques explained

November 15, 2025 GCP BigQuery GCP BigQuery metastore Databricks Unity Catalog

The article discusses the advancements in text-to-SQL capabilities using Google's Gemini models, which allow users to generate SQL queries from natural language prompts. It highlights the challenges faced in understanding user intent, providing business-specific context, and the limitations of large language models in generating precise SQL. Various techniques to improve text-to-SQL performance are explored, including intelligent retrieval of data, disambiguation methods, and validation processes. The article serves as an introduction to a series on enhancing text-to-SQL solutions within Google Cloud products.

How Yelp modernized its data infrastructure with a streaming lakehouse on AWS

November 13, 2025 Apache Paimon AWS Database Migration Service AWS S3

The article discusses Yelp's transformation of its data infrastructure through the adoption of a streaming lakehouse architecture on AWS. This modernization aimed to address challenges related to data processing latency, operational complexity, and compliance with regulations like GDPR. By migrating from self-managed Apache Kafka to Amazon MSK and implementing Apache Paimon for storage, Yelp achieved significant improvements, reducing analytics data latencies from 18 hours to minutes and cutting storage costs by over 80%. The article outlines the architectural shifts and technologies involved in this transformation.

Building Zone Failure Resilience in Apache Pinot™ at Uber

November 10, 2025 Apache Pinot

The article discusses Uber's implementation of zone failure resilience (ZFR) in Apache Pinot, a real-time analytics platform. It details the strategies used to ensure that Pinot can withstand zone failures without impacting queries or data ingestion. By leveraging instance assignment capabilities and integrating with Uber's isolation groups, the article outlines how they achieved a robust deployment model that enhances operational efficiency and reliability. The migration process for existing clusters to this new setup is also highlighted, showcasing the challenges and solutions involved.

How Ericsson achieves data integrity and superior governance with Dataplex

November 9, 2025 GCP Dataplex

The article discusses Ericsson's transformative journey towards data governance using Google Cloud's Dataplex Universal Catalog. It highlights the importance of data integrity and governance in modern telecommunications, particularly for Ericsson's Managed Services. The piece outlines the steps taken by Ericsson to operationalize its data strategy, emphasizing the need for clean, reliable data and the balance between compliance and innovation. It also touches on future priorities in AI-powered data governance and the lessons learned from their experience.

How Confluent Is Rebuilding Data Infrastructure for the Age of AI Agents

November 9, 2025

The article discusses how Confluent is evolving its data infrastructure to accommodate the demands of AI agents, emphasizing the need for real-time data processing capabilities. Key features introduced include Confluent Intelligence, a controlled stack for AI agent development, and the Real-Time Context Engine, which aims to provide timely data delivery. The article highlights the importance of integrating streaming data with AI systems to enhance decision-making and operational efficiency. It concludes by noting that the future of AI will depend on the robustness of the underlying data systems.

Supercharging the ML and AI Development Experience at Netflix with Metaflow

November 6, 2025 Metaflow

The article discusses the enhancements made to Metaflow, a framework for managing machine learning and AI workflows at Netflix. It introduces a new feature called Spin, which allows for rapid, iterative development similar to using notebook cells, enabling developers to quickly test and debug individual steps in their workflows. The article emphasizes the importance of state management and the differences between traditional software engineering and ML/AI development. It also highlights how Metaflow integrates with other tools to streamline the deployment process.

The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance

November 6, 2025

The article discusses a significant transformation in enterprise software architecture where AI agents evolve from assistive tools to operational execution engines. This shift is characterized by traditional backends transitioning to governance roles, particularly in sectors like banking, healthcare, and retail. The adoption of protocols like the Model Context Protocol (MCP) allows AI agents to directly invoke services and orchestrate workflows, leading to increased efficiency and autonomy in enterprise applications. Predictions indicate that by 2026, 40% of enterprise applications will incorporate such autonomous agents.

Identify User Journeys at Pinterest

November 5, 2025

The article discusses how Pinterest has developed a system to identify user journeys, which are sequences of user-item interactions that reveal user interests, intents, and contexts. By leveraging user data and machine learning techniques, Pinterest aims to enhance its recommendation system, moving beyond immediate interests to long-term user goals. The approach includes dynamic keyword extraction, clustering, and journey ranking, which collectively improve user engagement through personalized notifications. The article outlines the system architecture, key components, and the impact of journey-aware notifications on user interactions.

Designing AI-Driven User Memory for Personalization

November 4, 2025

The article discusses Zillow's innovative approach to personalization in the real estate market through AI-driven user memory. It emphasizes the importance of understanding user preferences and adapting to their evolving needs over time. The article outlines how Zillow combines batch and real-time data processing to create a dynamic user memory that enhances the home shopping experience. Key components include recency and frequency of user interactions, flexibility in preferences, and predictive modeling to anticipate user needs.

Exploring the Data Engineering Agent in BigQuery

November 3, 2025 GCP BigQuery GCP Dataplex

The article announces the preview of the Data Engineering Agent in BigQuery, designed to automate complex data engineering tasks. It highlights how the agent can streamline pipeline development, maintenance, and troubleshooting, allowing data professionals to focus on higher-level tasks. Key features include natural language pipeline creation, intelligent modifications, and integration with Dataplex for enhanced data governance. The article also shares positive feedback from early users, emphasizing the agent's potential to transform data engineering workflows. I would like to see how this help the data modelling tasks, usually DE team don't own this part or need cross ownership this with DA.

Requirement Adherence: Boosting Data Labeling Quality Using LLMs

October 28, 2025

The article discusses Uber AI Solutions' innovative in-tool quality-checking framework, Requirement Adherence, which enhances data labeling quality using large language models (LLMs). It outlines the challenges of traditional labeling workflows and presents a two-step approach involving rule extraction and in-tool validation. By leveraging LLMs, the framework efficiently identifies labeling errors in real-time, significantly reducing rework and costs for enterprise clients. The article emphasizes the importance of maintaining data privacy and the continuous improvement of the system through feedback mechanisms.

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning

October 27, 2025

This article discusses the challenges and advancements in post-training generative recommender systems, particularly focusing on a novel algorithm called Advantage-Weighted Supervised Fine-tuning (A-SFT). The authors highlight the limitations of traditional reinforcement learning methods in recommendation contexts, such as the lack of counterfactual observations and noisy reward models. A-SFT aims to improve recommendation quality by effectively combining supervised fine-tuning with reinforcement learning techniques. The results demonstrate that A-SFT outperforms existing methods in aligning generative models with user preferences.

From Word2Vec to LLM2Vec: How to Choose the Right Embedding Model for RAG

October 14, 2025

This article provides a comprehensive guide on selecting the appropriate embedding model for Retrieval-Augmented Generation (RAG) systems. It discusses the importance of embedding models in converting human language into machine-readable vectors and evaluates various types of embedding models, including sparse, dense, and hybrid models. Key factors for evaluating these models are outlined, such as context window, tokenization unit, dimensionality, and training data. The article concludes by emphasizing the need for practical testing with real-world data to ensure effective implementation.

Beyond classification: How AI agents are evolving Shopify's product taxonomy at scale

October 14, 2025

This article discusses how Shopify has transformed its product taxonomy management from manual processes to an AI-driven multi-agent system. The new system addresses challenges such as scaling taxonomy, maintaining consistency, and ensuring quality through automated analysis and domain expertise. By integrating real product data and employing specialized AI agents, Shopify can proactively adapt its taxonomy to meet evolving merchant and customer needs. The article highlights the efficiency gains and quality improvements achieved through this innovative approach.

Visualize data lineage using Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift

October 13, 2025

This article discusses how to visualize data lineage in Amazon SageMaker Catalog, integrating various AWS analytics services like AWS Glue, Amazon EMR, and Amazon Redshift. It provides a step-by-step guide on configuring resources and implementing data lineage tracking to enhance data governance and quality.

Scaling Subscriptions at The New York Times with Real-Time Causal Machine Learning

October 13, 2025

The article discusses how The New York Times has evolved its subscription strategy using real-time algorithms and causal machine learning to optimize its digital subscription funnel. It highlights the transition from static paywalls to dynamic decision-making processes that consider various business KPIs. The implementation of real-time algorithms allows for tailored user experiences based on engagement and conversion metrics, ultimately enhancing subscription and registration rates. The article emphasizes the importance of collaboration between data science and business leadership in defining objectives and constraints.

The Ultimate Guide to LLM Evaluation: Metrics, Methods & Best Practices

September 25, 2025

Amazon Strands Agents SDK: A technical deep dive into agent architectures and observability

Direct Data Sharing using Delta Sharing - Introduction: Our Journey to Empower Partners at Zalando

July 15, 2025 Delta Sharing Databricks Unity Catalog

i like this statement "We're not just building technology; we're building expertise." :)

Scaling recommendations service at OLX

July 14, 2025 ScyllaDB

Boost your Search and RAG agents with Vertex AI's new state-of-the-art Ranking API

July 13, 2025

this is exactly what i need "It takes the candidate list from your existing search or retrieval system and re-orders it based on deep semantic understanding", but how's the performance for the domain specific queries?

How good is your AI? Gen AI evaluation at every stage, explained

July 13, 2025

Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be | Part 1

July 11, 2025

Still not fully see the vision of LH 2.0 ...

April 2, 2025

Instead of maintaining separate connectors for each data source, developers can now build against a standard protocol.

A Deep Dive into Agoda's Generic Reconciliation Platform

March 31, 2025

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

March 28, 2025 AWS EventBridge

EMR showcase, but capture the EMR events into Lambda is interesting.

Journey of next generation control plane for data systems

March 27, 2025

How AI will disrupt data engineering as we know it

March 25, 2025

Agreed, DE will more focus on enabling the data value rather just build pipelines. But eventually we will need less human resources i think.

Foundation Model for Personalized Recommendation

March 24, 2025

For the recommendation of the title, i think the semantic based is more efficient than the user behaviour based.

How to Systematically Improve RAG Applications

March 21, 2025

Gartner Data Governance Maturity Model: What It Is, How It Works

March 20, 2025

Schema Change Management at Halodoc

March 18, 2025 AWS Glue AWS S3

Expressive Time Travel and Data Validation for Financial Workloads

March 17, 2025 Apache Iceberg AWS Database Migration Service

The validation and remediation are interesting.

Lithium: elevating ETL with ephemeral and self-hosted pipelines

March 14, 2025 Kafka Connectors

Interesting to see all the commands of input and output are by the Kafka topics events.

From Snowflake to Databricks: Our cost-effective journey to a unified data warehouse

March 13, 2025 Snowflake Databricks SQL Looker Studio

Snowflake or Databricks? You need make your choice ;)

Architecting Compliance: Cost-Effective Data Strategies for GDPR

March 12, 2025

PII fields are identified and separated stored is interesting, we would need to join later for any operations on it?

Creating source-aligned data products in Adevinta Spain

March 10, 2025 AWS Sagemaker

March 11, 2025

Time series forecasting with LLM-based foundation models and scalable AIOps on AWS

poc

Treat time series data as a language to be modeled by off-the-shelf transformer architectures.

Over 700 million events/second: How we make sense of too much data

March 7, 2025

Title Launch Observability at Netflix Scale - Part 3: System Strategies and Architecture

March 6, 2025

One Big Table (OBT)

March 5, 2025

From Lakehouse architecture to data mesh

March 4, 2025 Databricks Unity Catalog

modern data platform architecture based on Databrick tech stack.

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

poc

March 3, 2025 Apache Hive Apache Spark

Data Pipelines Architecture at BlaBlaCar

February 28, 2025 Apache Airflow GCP Dataflow GCP BigQuery

Very classic MDS

Redefining Data Engineering with Go and Apache Arrow

poc

February 27, 2025 Apache Arrow DuckDB

Sounds very first, but if we work with big dataset, how to handle the data transformation in the memory? If we work with small data, we can rewrite into Parquet format and the performance is not an issue.

Data Products: A Case Against Medallion Architecture

February 26, 2025

I read the blog but wasn’t fully convinced by its main argument. In my view, Medallion Architecture is just one way to manage data, and it doesn’t necessarily require physically moving or copying data between different stages. Simply tagging tables should be sufficient. Different stages can enforce distinct archival, retention policies, and operational processes. Additionally, from a high-level perspective, the concept of data products doesn’t fundamentally contradict Medallion Architecture.

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

February 25, 2025 AWS EventBridge AWS DynamoDB

Interesting architecture to handle bursty and unpredictable traffic on AWS

Towards composable data platforms

February 25, 2025

My understanding of "Table Virtualization" is share the tables between two data platforms.

Open Source Data Engineering Landscape 2025

February 24, 2025

Real good summary for the main tech products in the different categories of data industry!

The Unstructured Data Landscape

February 23, 2025

The Evolution of Business Intelligence

February 21, 2025

How Formula 1® uses generative AI to accelerate race-day issue resolution

February 20, 2025 AWS Glue AWS S3 AWS Bedrock AWS EventBridge

Very classic Glue job pipeline to feed the AWS Bedrock Knowledge Bases for a RAG use case.

How to use gen AI for better data schema handling, data quality, and data generation

February 13, 2025 AWS Bedrock

summarise of all the concept and technologies to build a production ready RAG solution.

How Uber Uses Ray® to Optimize the Rides Business

February 7, 2025 Apache Spark Ray

very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.

The foundations of Canva’s continuous data platform with Snowpipe Streaming

February 6, 2025 Snowflake

Paradigm Shifts in Data Processing for the Generative AI Era

February 5, 2025

"AI-centric" data processing focuses on preparing and managing large-scale, multimodal datasets efficiently for AI model training, fine-tuning, and deployment, rather than traditional database queries. It involves optimizing computation across heterogeneous resources (CPUs/GPUs), improving data flow efficiency, and enabling scalability—all crucial for building next-generation AI models.

How the Apache Arrow Format Accelerates Query Result Transfer

February 3, 2025 Apache Arrow

Building effective agents

January 31, 2025

Workflow: Evaluator-optimizer is interesting.

How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

January 30, 2025 Apache Spark

Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.

Automated GenAI-driven search quality evaluation

January 29, 2025

How Meta discovers data flows via lineage at scale

January 28, 2025

Explained in the three systems (API, data warehouse, AI inference), how to efficiently collect and validate the Lineage metadata.

How Monzo Bank reduced cost of TTL from time series index tables in Amazon Keyspaces

January 27, 2025 Apache Cassandra

Monzo Bank optimized their data retention strategy in Amazon Keyspaces by replacing the traditional Time to Live (TTL) approach with a bulk deletion mechanism. By partitioning time-series data across multiple tables, each representing a specific time bucket, they can efficiently drop entire tables of expired data. This method significantly reduces operational costs associated with per-row TTL deletions.

Introducing Easier Change Data Capture in Apache Spark™ Structured Streaming

January 27, 2025 Apache Spark

The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Introducing Configurable Metaflow

January 26, 2025 Metaflow

Why Fluss? Top 4 Challenges of Using Kafka for Real-Time Analytics

January 25, 2025

Introducing Fluss: Streaming Storage for Real-Time Analytics

January 25, 2025 Fluss

JD.com's Exploration and Practice of Big Data Governance (CN)

January 11, 2025 Apache Iceberg Apache Airflow AWS Glue

Improving Search Ranking for Maps

January 10, 2025

Building a User Signals Platform at Airbnb

January 10, 2025 Apache Flink

A Journey Towards Unified Data Governance at bp

January 9, 2025

MLOps Best Practices - MLOps Gym: Crawl

Apache Iceberg: The Hadoop of the Modern Data Stack?

December 31, 2024 Apache Iceberg

Good summarise the current problem for using Iceberg system, but the new S3 Table looks addressing all these pain points.

DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

December 30, 2024

Mastering Airflow DAG Standardization with Python’s AST: A Deep Dive into Linting at Scale

December 29, 2024 Apache Airflow

Practical text-to-SQL for data analytics

December 28, 2024

Very good sharing blog, a lot of tips to build a modern LLM RAG app.

Selecting a model for semantic search at Dropbox scale

December 27, 2024

Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned

December 26, 2024 Apache Spark Apache Iceberg

Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.

A First Look at S3 (Iceberg) Tables

December 25, 2024 AWS Glue AWS S3 Apache Iceberg

S3 Table bucket handle the Iceberg compaction and catalog maintenance tasks for you.

Views pwn Tables as data interfaces

December 24, 2024 AWS Glue AWS Redshift AWS S3 AWS Athena

Twitch has leveraged Views in their Data Lake to enhance data agility, minimize downtime, and streamline development workflows. By utilizing Views as interfaces to underlying data tables, they've enabled seamless schema modifications, such as column renames and VARCHAR resizing, without necessitating data reprocessing. This approach has facilitated rapid responses to data quality issues and supported efficient ETL processes, contributing to a scalable and adaptable data infrastructure.

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

December 23, 2024 AWS Glue AWS Sagemaker

Our journey to Snowflake monitoring mastery

December 22, 2024 dbt Snowflake

Parquet pruning in DataFusion

December 21, 2024 Apache DataFusion

December 13, 2024 AWS Glue AWS S3 AWS Neptune AWS Redshift AWS Athena dbt

Build a process to built the complete data lineage information by merging the partial lineage generated by dbt automatically.

Lucene: Uber’s Search Platform Version Upgrade

December 12, 2024 Apache Lucene

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

December 10, 2024 Apache Iceberg

December 5, 2024

Netflix’s Distributed Counter Abstraction

December 4, 2024 Apache Cassandra

Netflix's Distributed Counter Abstraction is a scalable service designed to handle high-throughput counting operations with low latency. It supports two primary counter types: Best-Effort, which offers near-immediate access with potential slight inaccuracies, and Eventually Consistent, which ensures accurate counts with minimal delays. This abstraction is built atop Netflix's TimeSeries Abstraction and is managed via the Data Gateway Control Plane, allowing for flexible configuration and global deployment.

Introducing Netflix’s TimeSeries Data Abstraction Layer

December 2, 2024 Apache Cassandra Elasticsearch

Netflix's TimeSeries Data Abstraction Layer is designed to efficiently store and query vast amounts of temporal event data with low millisecond latency. It addresses challenges such as high throughput, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency. The abstraction integrates with storage backends like Apache Cassandra and Elasticsearch, offering flexibility and scalability to support Netflix's diverse use cases.

November 21, 2024 SqlMesh

4 Key Benefits of Shift Left

November 20, 2024

This is one of big difference comparing to DE with SE.

Right-sizing Spark executor memory

November 19, 2024 Apache Spark

August 28, 2024

This article discusses Airbnb's development of a comprehensive Data Protection Platform (DPP) to address challenges in data security and privacy compliance. The platform integrates various services like Madoka for metadata management, Inspekt for data classification, and Cipher for encryption. It highlights the need for automated data protection due to the complexity of handling sensitive data across different environments and the importance of complying with global regulations like GDPR and CCPA.

Choosing a Data Quality Tool

April 6, 2022 Great Expectations Metaplane Lightup Bigeye Datafold Monte Carlo Data Soda

It's a good high level summary, but i think each team still need make some spike to find out the suitable tools for their use case and project.

From Rows to People

April 3, 2022 Zingg

The Unbundling of Airflow

March 29, 2022 Apache Airflow dbt fal dbt

This article is for talk about the idea behind fal dbt, extend the dbt capability on airflow platform. It also talk about a lot of other popular tools on Airflow.

Rebundling the Data Platform

March 28, 2022 Dagster

I think Dagster has zoom in from Job level view to the asset/table level view for the pipelines. There is always having the Pro and Cons.

Google Cloud helps UK-based fintech Fluidly scale to 50,000 customers and beyond

March 24, 2022

It's a good showcase blog for GCP, but it would be very interesting to see some more detail about how Fluidly data team leverage GCP to launch their new data driven business products.

Why We Switched Our Data Orchestration Service

March 23, 2022 Flyte

Introducing Natural Language Search for Podcast Episodes

March 23, 2022 Vespa

Build event-driven data quality pipelines with AWS Glue DataBrew

Why We Bet on Rust to Supercharge Feature Store at Agoda

The article discusses Agoda's decision to migrate their Feature Store Serving system from a JVM-based stack to Rust, driven by performance and reliability challenges. It details the migration process, including the proof of concept, performance benchmarks, and the importance of shadow testing to ensure correctness. The transition resulted in significant efficiency gains, handling five times more traffic while drastically reducing CPU and memory usage, leading to substantial cost savings. The article emphasizes the role of AI tools and the Rust compiler in facilitating the team's adoption of Rust despite their initial lack of experience.

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

This article provides an overview of four primary methods for evaluating large language models (LLMs): multiple-choice benchmarks, verifiers, leaderboards, and LLM judges. It discusses the advantages and limitations of each method, emphasizing the importance of understanding these evaluation techniques for better interpreting model performance. The article also includes code examples for implementing these evaluation methods from scratch, making it a valuable resource for practitioners in the field of LLM development and evaluation.

Building End-to-End Data Lineage for One-Time and Complex Queries Using Amazon Athena, Amazon Redshift, Amazon Neptune, and dbt

dbt AWS Redshift AWS Datazone AWS Neptune

This article discusses the challenges and solutions for building end-to-end data lineage in enterprise data analytics, particularly for one-time and complex queries. It highlights the use of Amazon Athena, Amazon Redshift, Amazon Neptune, and dbt to create a unified data modeling language across different platforms. The authors explain how to automate the data lineage generation process using AWS services like Lambda and Step Functions, ensuring accuracy and scalability. The article provides insights into the architecture and implementation details necessary for effective data lineage tracking.

Inside the Race to Build Agent-Native Databases

The article discusses the evolving landscape of databases as they adapt to meet the needs of emerging AI agents. It highlights four innovative initiatives: AgentDB, which treats databases as disposable files; 'Postgres for Agents,' enhancing PostgreSQL for agent use; Databricks Lakebase, merging transactional and analytical capabilities; and Bauplan Labs, focusing on safety and reliability in data operations. These initiatives reflect a broader trend of reimagining database functionality to better serve machine users in an agent-native world.

Move Beyond Chain-of-Thought with Chain-of-Draft on Amazon Bedrock

AWS Bedrock AWS Database Migration Service

The article discusses the Chain-of-Draft (CoD) prompting technique, which offers a more efficient alternative to the traditional Chain-of-Thought (CoT) method for large language models. CoD reduces verbosity and improves cost efficiency and response times by limiting reasoning steps to five words or less. The authors demonstrate the implementation of CoD using Amazon Bedrock and AWS Lambda, showcasing significant reductions in token usage and latency while maintaining accuracy. The article emphasizes the practical benefits of CoD for organizations scaling their generative AI implementations.

Halodoc’s Layered Data Validation Strategy for Building Trust in the Lakehouse

AWS Database Migration Service Great Expectations AWS Redshift AWS Athena

The article outlines Halodoc's comprehensive approach to data validation within a Lakehouse architecture, emphasizing the importance of data accuracy and reliability. It describes a multi-layered validation strategy that employs AI to enhance data quality checks at various stages of the data pipeline. The validation layers include checks for data consistency, structural correctness, business correctness, and reconciliation, ensuring that data remains trustworthy throughout its journey. The implementation of this strategy has led to reduced data incidents and increased trust among analytics and product teams.

Build AWS Glue Data Quality pipeline using Terraform

AWS Glue AWS DynamoDB

The article discusses the implementation of AWS Glue Data Quality pipelines using Terraform, highlighting two methods: ETL-based and Catalog-based Data Quality validation. It explains how these methods can ensure comprehensive data quality across data lakes and pipelines, utilizing a real-world dataset of NYC yellow taxi trips. The article emphasizes the benefits of Infrastructure as Code (IaC) practices for consistent and repeatable deployments, and provides a step-by-step guide for setting up the necessary resources in AWS.

coSTAR: How We Ship AI Agents at Databricks Fast, Without Breaking Things

MLflow

The article discusses the coSTAR methodology developed at Databricks for building and deploying AI agents with a focus on automated testing and refinement. It highlights the transition from a slow, manual review process to a rapid, automated testing framework that significantly reduces the time to verify changes. By using MLflow and a structured approach involving scenario definitions, trace capture, and judge assessments, coSTAR enhances development velocity and confidence in the quality of AI agents. The methodology addresses the unique challenges of testing non-deterministic outputs in AI systems.

Data Mesh at Grab (Part II): The foundational tools behind certification

Datahub Data Prep Kit AWS Database Migration Service

This article discusses the foundational tools that support data certification within Grab's data mesh, referred to as the Signals Marketplace. It highlights three key platforms: Hubble for metadata management and data discovery, Genchi for data quality observability, and the Data Contract Registry for managing producer-consumer agreements. The integration of these tools aims to enhance trust in certified data assets and streamline data governance practices across the organization. The article emphasizes the operationalization of data mesh principles through these platforms, enabling teams to efficiently manage and reuse data.

Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update