Tech Articles

Filter articles by tags or search for specific topics:

How I Structure My Data Pipelines

tech1
March 22, 2026 Databricks Unity Catalog AWS Database Migration Service dbt

The article discusses the author's approach to structuring data pipelines by integrating the medallion architecture, Kimball dimensional modeling, and semantic layers. It emphasizes the importance of defining clear roles and outputs for each layer—Bronze, Silver, and Gold—to cater to different user needs. The author argues for making the semantic layer a first-class priority in data architecture, highlighting its role in providing governed metrics for self-service analytics. The article concludes with a concrete example of how marketing attribution data flows through this architecture.

Evaluating AI agents: Real-world lessons from building agentic systems at Amazon

project
March 22, 2026 AWS Bedrock

The article discusses the evolution of generative AI applications into agentic AI systems at Amazon, highlighting the need for a comprehensive evaluation framework. It emphasizes the importance of assessing not just individual model performance but also the emergent behaviors of the entire system. The authors present a detailed evaluation methodology that includes automated workflows and a library of metrics tailored for agentic AI applications. Best practices and lessons learned from real-world implementations are shared to guide developers in evaluating and deploying these complex systems effectively.

Scaling PostgreSQL to power 800 million ChatGPT users

project
March 19, 2026 Apache Pinot Postgres DB

The article discusses how OpenAI has successfully scaled PostgreSQL to handle the demands of 800 million ChatGPT users, achieving millions of queries per second. It outlines the challenges faced during high write traffic, the optimizations implemented, and the architectural decisions made to maintain performance and reliability. Key strategies include offloading read traffic, optimizing queries, and managing workloads to prevent service degradation. The article also highlights the importance of connection pooling and caching to enhance database efficiency.

Memory: How Agents Learn

tutorial
March 18, 2026 Data Prep Kit Ray Data

The article discusses the importance of memory in AI agents, particularly how it enables them to learn from past interactions and improve their performance over time. It categorizes memory into three types: session memory, user memory, and learned memory, each with distinct characteristics and benefits. The author provides code examples for implementing these memory types in agents, emphasizing the significance of learned memory for enhancing agent capabilities. The article concludes with a discussion on what constitutes a good learning and the need for human oversight in the learning process.

Building a conversational agent in BigQuery using the Conversational Analytics API

tutorial
March 17, 2026 GCP BigQuery

This article provides a comprehensive guide on building a conversational agent in BigQuery using the Conversational Analytics API. It outlines the steps to configure the agent, create conversations, and manage interactions with users. The API enables users to query BigQuery data using natural language, facilitating real-time insights and dynamic reporting. The article emphasizes the importance of clear system instructions and schema descriptions to enhance the agent's effectiveness.

Demystifying evals for AI agents

tech1
March 17, 2026

The article discusses the complexities of evaluating AI agents, emphasizing the importance of rigorous evaluations (evals) throughout the agent lifecycle. It outlines various evaluation structures, types of graders, and the significance of early and continuous eval development. The piece highlights the challenges faced by teams without evals, which can lead to reactive development cycles. It also provides insights into different agent types and their evaluation techniques, ultimately advocating for a systematic approach to agent evaluation to enhance performance and reliability.

The state of data mesh in 2026: From hype to hard-won maturity

vision
March 16, 2026

The article discusses the evolution of data mesh from a concept filled with hype to a mature socio-technical paradigm by 2026. It emphasizes the challenges organizations face in implementing data mesh, particularly in changing organizational behaviors and aligning data initiatives with business strategies. The authors share insights on the four core principles of data mesh: domain ownership, treating data as a product, self-serve data platforms, and federated computational governance. The article concludes that successful data mesh implementations require a long-term commitment to organizational transformation rather than merely adopting new technologies.

Optimizing Flink’s join operations on Amazon EMR with Alluxio

tech1
March 15, 2026 Alluxio Apache Flink

The article discusses the challenges of correlating real-time data with historical data in data analysis, particularly in e-commerce scenarios. It presents an optimized solution using Apache Flink to join streaming order data with historical customer and product information, leveraging Alluxio for caching. The implementation details include using Hive dimension tables and Flink's temporal joins to enhance performance and reduce bottlenecks. The article also addresses state management issues in Flink applications and provides insights into improving data processing efficiency.

Reimagining LinkedIn’s Search Tech Stack

tech2
March 14, 2026 Apache Arrow Apache Spark

The article discusses LinkedIn's transformation of its search technology stack, focusing on the integration of large language models (LLMs) to enhance search experiences. It details the challenges and innovations involved in deploying LLMs at scale, including query understanding, semantic retrieval, and ranking processes. The use of AI-driven job and people search features aims to provide more relevant and personalized results. Additionally, the article highlights the importance of continuous relevance measurement and quality evaluation in maintaining a high-quality search experience.

Unified Context-Intent Embeddings for Scalable Text-to-SQL

project
March 13, 2026 AWS OpenSearch Apache Airflow Apache Spark

This article discusses Pinterest's evolution from basic Text-to-SQL systems to a sophisticated Analytics Agent that leverages unified context-intent embeddings for improved SQL generation and table discovery. The system addresses the challenges of understanding analytical intent and provides a structured approach to data governance and documentation. By encoding historical query patterns and utilizing AI-generated documentation, the agent enhances the efficiency and reliability of data analytics at Pinterest. The article outlines the architecture and operational principles behind the agent's design, emphasizing the importance of context and governance in AI-driven analytics.

The next evolution of Delta - Catalog-Managed Tables

tech1
March 8, 2026 Databricks Unity Catalog Delta Lake

The article discusses the introduction of catalog-managed tables in Delta Lake 4.1.0, which shift the management of table access and metadata from the filesystem to a catalog-centric model. This change aims to simplify table discovery, enhance governance, and improve performance by allowing clients to reference tables by name rather than by path. The article also highlights the challenges faced with filesystem-managed tables and how catalog-managed tables address these issues, paving the way for a more interoperable and efficient data ecosystem.

Engineering LinkedIn's Job Ingestion System at Scale

project
March 7, 2026

The article discusses the architecture and challenges of LinkedIn's job ingestion system, which processes millions of job postings daily from diverse sources. It highlights the importance of reliability, scalability, and extensibility in handling heterogeneous job data feeds. The system employs a modular, event-driven pipeline that includes job intake and processing stages, utilizing various methods for data extraction and transformation. The article emphasizes the need for robust security protocols and maintaining data quality to ensure a trustworthy job catalog for users.

Your Agents Need Runbooks, Not Bigger Context Windows

tech1
March 7, 2026

The article discusses the necessity of operational memory in AI agents, particularly in high-stakes environments where reliability is crucial. It critiques the current reliance on large context windows and highlights the inefficiencies of existing memory architectures. The author proposes a Context File System (CFS) that separates reasoning from execution, allowing agents to build a library of proven procedures. This shift aims to enhance automation and reduce costs in enterprise settings.

Zero-ETL Integrations with Amazon OpenSearch Service

product
March 7, 2026 AWS OpenSearch AWS DynamoDB

The article discusses the capabilities of Amazon OpenSearch Service, focusing on its zero-ETL integrations with various AWS services. It highlights how these integrations simplify data access and analysis by eliminating the need for complex ETL pipelines. The article covers specific integrations with services such as Amazon S3, CloudWatch, DynamoDB, RDS, Aurora, and DocumentDB, detailing their features, benefits, and best practices. Overall, it emphasizes the operational efficiency and innovation acceleration that zero-ETL integrations can provide for real-time analytics and search applications.

Unified Data Discovery with Business Context in Unity Catalog

product
March 7, 2026 Databricks Unity Catalog

The article discusses the challenges organizations face in finding and verifying data across analytics and AI workflows. It introduces Databricks' new Discover experience, which integrates business context and trust into the Unity Catalog, allowing users to find and access trusted data and AI assets more efficiently. The article highlights the importance of domains, intelligent curation, and governed access in facilitating a unified discovery experience that enhances user confidence and reduces bottlenecks in data access.

DataJunction as Netflix’s answer to the missing piece of the modern data stack

project
March 5, 2026 Data Junction

The article discusses Netflix's challenges in managing metrics within its experimentation platform and how DataJunction, an open-source metric platform, addresses these issues. It highlights the importance of a centralized semantic layer for defining metrics and dimensions, which simplifies the onboarding process for data scientists and analytics engineers. The authors detail the architecture and design decisions behind DataJunction, emphasizing its SQL parsing capabilities and integration with existing tools. The article concludes with plans for further integration and unification of analytics at Netflix.

Under the hood: an introduction to the Native Execution Engine for Microsoft Fabric

tech1
March 4, 2026 Apache Spark

The article introduces the Native Execution Engine for Microsoft Fabric, designed to enhance Apache Spark's performance without requiring code changes. It explains the challenges faced by traditional Spark execution due to increasing data volumes and real-time processing demands. The Native Execution Engine leverages C++ and vectorized execution to optimize Spark workloads, particularly for columnar data formats like Parquet and Delta Lake. The integration of open-source technologies Velox and Apache Gluten is highlighted, showcasing significant performance improvements and cost savings for users.

Balancing Cost and Reliability for Spark on Kubernetes

project
March 4, 2026 Apache Spark Apache Paimon

The article discusses the development and implementation of Spot Balancer, a tool created by Notion in collaboration with AWS, which optimizes the use of Spark on Kubernetes by balancing cost and reliability. It highlights the challenges faced when using Spot Instances for Spark jobs and how Spot Balancer allows for better control over executor placement to prevent job failures. The article outlines the transition from Amazon EMR to EMR on EKS and the benefits of dynamic provisioning and efficient resource management. Ultimately, the tool has helped Notion reduce Spark compute costs by 60-90% without sacrificing reliability.

Our Multi-Agent Architecture for Smarter Advertising

project
February 28, 2026 Apache Pinot dbt Apache Spark Ray Data

The article discusses Spotify's innovative multi-agent architecture designed to enhance its advertising platform. By addressing the fragmented decision-making processes across various advertising channels, the architecture aims to unify workflows and optimize campaign management through specialized AI agents. This approach allows for more efficient budget allocation, audience targeting, and overall campaign performance, leveraging historical data and machine learning. The article highlights the importance of a programmable decision layer and the challenges faced in implementing this system.

AI Agent-Driven Browser Automation for Enterprise Workflow Management

tech2
January 22, 2026 AWS Bedrock AWS EventBridge

The article discusses the challenges enterprises face with manual workflows across multiple web applications and introduces AI agent-driven browser automation as a solution. It highlights how AI agents can intelligently navigate complex workflows, reduce manual intervention, and improve operational efficiency. The article provides a detailed example of an e-commerce order management system that utilizes Amazon Bedrock and AI agents for automating order processing across various retailer websites. It emphasizes the importance of human oversight in handling exceptions and maintaining compliance.

Orchestrating Success

project
January 21, 2026 Apache Airflow dbt

The article discusses Vinted's journey in standardizing large-scale decentralized data pipelines as they migrated their data infrastructure to the cloud. Initially, teams operated independently, but as dependencies grew, coordination became challenging. To address this, they developed a DAG generator that abstracts pipeline creation and standardizes dependency interactions, allowing teams to focus on data models rather than orchestration details. This approach improved visibility and reduced operational complexity across decentralized teams.

Building a multi-agent pipeline for NL-to-SQL analytics

project
January 21, 2026

This article discusses the development of a multi-agent architecture for a natural language to SQL (NL-to-SQL) analytics system. It highlights the limitations of a monolithic MCP-based system and presents a new A2A (agent-to-agent) pipeline that improves stability, scalability, and error isolation. The article details the roles of specialized agents in the pipeline and how they collaborate to convert user queries into accurate SQL. Additionally, it emphasizes the importance of a structured data mart for efficient analytics execution.

Build a generative AI-powered business reporting solution with Amazon Bedrock

tutorial
January 16, 2026 AWS Bedrock AWS DynamoDB

This article presents a solution for automating business reporting using generative AI and Amazon Bedrock. It highlights the inefficiencies of traditional reporting processes and introduces a serverless architecture that leverages AWS services to streamline report writing and enhance internal communication. The solution includes a user-friendly interface for associates and managers, enabling efficient report generation and submission. Additionally, it addresses challenges such as data management and risk mitigation associated with AI implementation.

How Slack achieved operational excellence for Spark on Amazon EMR using generative AI

tech1
January 15, 2026 AWS Bedrock Apache Spark AWS EventBridge Ray Data

The article discusses how Slack developed a comprehensive metrics framework to enhance the performance and cost-efficiency of their Apache Spark jobs on Amazon EMR. By integrating generative AI and custom monitoring tools, they achieved significant improvements in job completion times and cost reductions. The framework captures over 40 metrics, providing granular insights into application behavior and resource usage. The article outlines the architecture of their monitoring solution and the benefits of AI-assisted tuning for Spark operations.

Introducing Visa Intelligent Commerce on AWS: Enabling agentic commerce with Amazon Bedrock AgentCore

project
January 15, 2026 AWS Bedrock Apache Paimon

The article discusses the collaboration between AWS and Visa to introduce Visa Intelligent Commerce, which leverages Amazon Bedrock AgentCore to enable agentic commerce. This new approach allows for seamless, autonomous payment experiences that reduce manual intervention in transactions. The article explains how intelligent agents can handle multi-step tasks in various sectors, particularly in payments and shopping, transforming traditional workflows into more efficient, outcome-driven processes. It also highlights the technical architecture and tools involved in building these agentic workflows.

Data Engineering in 2026: What Changes?

vision
January 14, 2026 Ray Data PromptWizard Data Prep Kit

The article discusses the evolving landscape of data engineering as it adapts to the needs of AI agents in an increasingly automated environment. It emphasizes the importance of building reliable, code-first data platforms that can handle multimodal data and provide context for agents. The shift from traditional data engineering tasks to high-level system supervision is highlighted, along with the necessity for safety and correctness in data pipelines. Ultimately, the article envisions a future where humans and AI agents collaborate seamlessly, transforming data engineering practices.

State of Large Language Models in 2025

vision
January 13, 2026

The article reflects on the significant advancements in large language models (LLMs) throughout 2025, highlighting key developments such as reasoning models, reinforcement learning with verifiable rewards (RLVR), and the GRPO algorithm. It discusses the evolving landscape of LLM architectures, the importance of inference scaling, and the challenges of benchmarking in the field. The author shares predictions for future trends in LLM development, emphasizing the need for continual learning and domain specialization. Overall, it provides a comprehensive overview of the state of LLMs and their implications for various industries.

Why We Use Separate Tech Stacks for Personalization and Experimentation

vision
January 8, 2026

The article discusses the importance of separating tech stacks for personalization and experimentation at Spotify. It explains how personalized applications enhance user experiences by tailoring content to individual preferences using advanced machine learning models. The distinction between personalization and experimentation is highlighted, emphasizing the need for different infrastructures and methodologies for each. The article also outlines the benefits of this separation in terms of scalability and efficiency in evaluating recommendation systems.

Unifying Governance and Metadata Across Amazon SageMaker Unified Studio and Atlan

tech1
January 4, 2026 Atlan AWS Sagemaker

This article discusses the integration of Atlan and Amazon SageMaker Unified Studio to unify governance and metadata management across data and AI environments. It highlights the importance of maintaining consistent metadata in hybrid environments where different teams use various tools. The article provides a detailed overview of the integration process, including setting up secure connections and automated synchronization of metadata. It emphasizes the benefits of having a single, trusted view of data assets for both business and technical users.

Agentic AI: Engineering Reliability in Operational Systems

vision
December 29, 2025

At QCon AI NYC 2025, Aaron Erickson presented agentic AI as an engineering challenge rather than a simple prompt crafting task. He emphasized the importance of combining probabilistic components with deterministic boundaries to enhance reliability. The article discusses the role of specialized agents and deterministic tools in operational systems, highlighting the need for structured outputs and effective tool selection. Erickson's insights provide a framework for understanding the complexities of deploying AI in real-world applications.

Unlocking Entertainment Intelligence with Knowledge Graph

tech2
November 27, 2025

The article discusses Netflix's implementation of an Entertainment Knowledge Graph, which unifies disparate entertainment datasets into a cohesive ecosystem. This ontology-driven architecture enhances analytics, machine learning, and strategic decision-making by providing semantic connectivity and conceptual consistency. It addresses challenges in traditional data management by allowing rapid integration of new data types and relationships, ultimately improving insights into the entertainment landscape. The article outlines the architecture, use cases, and future outlook of the knowledge graph.

How Care Access achieved 86% data processing cost reductions and 66% faster data processing with Amazon Bedrock prompt caching

project
November 25, 2025 AWS Bedrock AWS S3

The article discusses how Care Access, a healthcare organization, utilized Amazon Bedrock's prompt caching feature to significantly reduce data processing costs by 86% and improve processing speed by 66%. By caching static medical record content while varying analysis questions, Care Access optimized their operations to handle large volumes of medical records efficiently while maintaining compliance with healthcare regulations. The implementation details, including the architecture and security measures, are also highlighted, showcasing the transformative impact of this technology on their health screening program.

650GB of Data (Delta Lake on S3). Polars vs DuckDB vs Daft vs Spark.

tech1
November 24, 2025 Apache Spark DuckDB DuckDB

The article discusses the challenges of processing large datasets using single-node frameworks like Polars, DuckDB, and Daft compared to traditional Spark clusters. It highlights the concept of 'cluster fatigue' and the emotional and financial costs associated with running distributed systems. The author conducts a performance comparison of these frameworks on a 650GB dataset stored in Delta Lake on S3, demonstrating that single-node frameworks can effectively handle large datasets without the need for extensive resources. The findings suggest that modern Lake House architectures can benefit from these lightweight alternatives.

Understanding Spark on YARN: Resource Management and Communication

tech2
November 23, 2025 Apache Spark

This article explores how Apache Spark interacts with YARN for resource management in a cluster environment. It details the roles of YARN's components: Resource Manager, Application Master, and Node Manager, and explains the communication process during Spark application execution. The author discusses common exceptions encountered when running Spark on YARN, emphasizing the importance of understanding these interactions for effective troubleshooting. The article serves as a guide for advanced users looking to optimize Spark applications on YARN.

Simple Queries in Spark Catalyst Optimisation (1)

tech2
November 23, 2025 Apache Spark

This article explores the performance benefits of using Spark SQL's Catalyst optimizer, particularly focusing on DataFrame transformations. It discusses the four stages of Catalyst optimization, emphasizing the Physical Plan stage and how caching DataFrames can significantly improve query performance. The author provides insights into the execution plans generated by Spark and the implications of using UnsafeRow for memory management. Ultimately, the article concludes that while simple queries may not benefit from Catalyst optimization without caching, performance can be enhanced when DataFrames are cached.

Simple Queries in Spark Catalyst Optimisation (2) Join and Aggregation

tech2
November 23, 2025 Apache Spark

This article explores the join and aggregation operations in Spark's Catalyst optimization engine. It discusses how Spark generates execution plans for these operations, including SortMergeJoin and HashAggregate, and the underlying mechanisms that ensure efficient data processing. The author highlights the complexities of data shuffling and the importance of distribution and ordering in Spark plans. Overall, the article provides insights into the optimization strategies employed by Spark Catalyst for handling join and aggregation queries.

Introducing Contextual Retrieval

tech1
November 23, 2025

The article introduces a new method called Contextual Retrieval, which enhances the traditional Retrieval-Augmented Generation (RAG) approach by improving the retrieval step through two sub-techniques: Contextual Embeddings and Contextual BM25. This method significantly reduces retrieval failures and improves the accuracy of AI models in specific contexts. The article also discusses the importance of context in information retrieval and provides insights into implementing this technique using Claude, along with considerations for performance optimization and cost reduction.

The 'Think' Tool: Enabling Claude to Stop and Think in Complex Tool Use Situations

tech1
November 23, 2025

This article discusses the introduction of the 'think' tool for Claude, which enhances its ability to solve complex problems by allowing it to pause and reflect during multi-step tasks. Unlike the 'extended thinking' capability, the 'think' tool focuses on processing new information and ensuring compliance with policies. The article provides practical guidance for implementing the tool, backed by performance evaluations showing significant improvements in customer service scenarios. It emphasizes the importance of strategic prompting and offers best practices for effective use.

Effective Context Engineering for AI Agents

tech1
November 23, 2025

The article discusses the emerging field of context engineering for AI agents, emphasizing the importance of managing context as a finite resource. It contrasts context engineering with prompt engineering, highlighting the need for strategies that optimize the utility of tokens during LLM inference. The article explores techniques such as compaction, structured note-taking, and sub-agent architectures to maintain coherence over long-horizon tasks. It concludes by stressing the significance of thoughtful context curation to enhance agent performance.

Building a reactive Fraud Prevention Platform

project
November 22, 2025

The article discusses the redesign of Monzo's Fraud Prevention Platform, emphasizing the challenges of detecting and preventing fraud in a fast-paced environment. It outlines the complexities of fraud detection, including the sophistication and speed of fraudsters, and the need for a balance between user experience and security. The system design is explained in detail, highlighting the use of machine learning models and a microservices architecture to monitor and respond to fraudulent activities effectively.

Iceberg REST Catalog Now Supported in BigLake Metastore for Open Data Interoperability

product
November 20, 2025 Apache Iceberg Apache Spark GCP BigQuery GCP BigQuery metastore

Google Cloud has announced the general availability of Iceberg REST Catalog support in BigLake metastore, enhancing open data interoperability across various data engines. This fully-managed, serverless metastore allows users to query data using their preferred engines, including Apache Spark and BigQuery, without the need for data duplication. The integration with Dataplex Universal Catalog provides comprehensive governance and lineage capabilities. Organizations like Spotify are already leveraging this technology to build modern lakehouse platforms.

What it means to get your data ready for AI

vision
November 19, 2025

The article discusses the evolving role of data engineers in the context of Agentic AI, highlighting how the interaction with data is shifting from a builder-centric model to a more user-driven approach. It emphasizes the importance of rethinking traditional ETL/ELT processes, prioritizing data curation over mere collection, and building infrastructure that supports AI agents. The author outlines five key principles for data engineers to adapt to these changes, focusing on context-aware data handling and the management of AI-generated artifacts.

LyftLearn Evolution: Rethinking ML Platform Architecture

project
November 18, 2025 AWS Sagemaker

The article discusses Lyft's transition from a fully Kubernetes-based machine learning platform to a hybrid architecture utilizing AWS SageMaker for offline workloads and Kubernetes for online model serving. It highlights the challenges faced with the original architecture, including operational complexity and resource management, and details the technical decisions made to simplify the infrastructure while maintaining performance. The migration aimed to reduce operational overhead and improve reliability, allowing teams to focus on developing new capabilities rather than managing infrastructure.

Accelerating generative AI applications with a platform engineering approach

product
November 18, 2025 AWS Bedrock AWS DynamoDB Apache DataFusion

The article discusses how organizations can leverage platform engineering principles to accelerate the development and deployment of generative AI applications. It highlights the challenges faced by organizations in experimenting with generative AI and emphasizes the importance of building reusable components to manage costs and improve efficiency. The article outlines the architecture of generative AI applications, including the integration of various data layers and the role of large language models. It also covers best practices for observability, orchestration, and governance in AI workflows.

Accelerate enterprise solutions with agentic AI-powered consulting: Introducing AWS Professional Service Agents

vision
November 17, 2025 AWS Bedrock AWS Database Migration Service

The article introduces AWS Professional Services' new approach to consulting, leveraging agentic AI to enhance cloud adoption and digital transformation for organizations. It highlights the role of specialized AI agents in streamlining consulting processes, improving solution quality, and reducing project timelines. The integration of AI with human expertise is emphasized as a means to deliver better customer outcomes. Real-world examples, including the NFL's use of AWS agents, illustrate the tangible benefits of this innovative consulting model.

Reduce CAPTCHAs for AI agents browsing the web with Web Bot Auth (Preview) in Amazon Bedrock AgentCore Browser

product
November 16, 2025 AWS Bedrock

The article discusses the challenges AI agents face when browsing the web, particularly with CAPTCHAs and other bot detection mechanisms. It introduces Amazon Bedrock AgentCore Browser's new feature, Web Bot Auth, which provides AI agents with verifiable cryptographic identities to reduce CAPTCHA friction. The article explains how this protocol works and its collaboration with WAF providers to ensure secure access for verified bots. It highlights the benefits for both AI agents and website owners in managing automated traffic.

Analyzing Amazon EC2 Spot instance interruptions by using event-driven architecture

tutorial
November 16, 2025 AWS EventBridge AWS OpenSearch

This article discusses the challenges associated with using Amazon EC2 Spot Instances within Auto Scaling Groups due to their unpredictable interruptions. It presents a solution in the form of a custom event-driven monitoring and analytics dashboard, named 'Spot Interruption Insights', which provides near real-time visibility into Spot Instance interruptions. The article outlines a step-by-step guide to building this monitoring solution using various AWS services, including Amazon EventBridge, SQS, Lambda, and OpenSearch Service, to optimize capacity planning and improve workload resilience.

Getting AI to write good SQL: Text-to-SQL techniques explained

product
November 15, 2025 GCP BigQuery GCP BigQuery metastore Databricks Unity Catalog

The article discusses the advancements in text-to-SQL capabilities using Google's Gemini models, which allow users to generate SQL queries from natural language prompts. It highlights the challenges faced in understanding user intent, providing business-specific context, and the limitations of large language models in generating precise SQL. Various techniques to improve text-to-SQL performance are explored, including intelligent retrieval of data, disambiguation methods, and validation processes. The article serves as an introduction to a series on enhancing text-to-SQL solutions within Google Cloud products.

How Yelp modernized its data infrastructure with a streaming lakehouse on AWS

project
November 13, 2025 Apache Paimon AWS Database Migration Service AWS S3

The article discusses Yelp's transformation of its data infrastructure through the adoption of a streaming lakehouse architecture on AWS. This modernization aimed to address challenges related to data processing latency, operational complexity, and compliance with regulations like GDPR. By migrating from self-managed Apache Kafka to Amazon MSK and implementing Apache Paimon for storage, Yelp achieved significant improvements, reducing analytics data latencies from 18 hours to minutes and cutting storage costs by over 80%. The article outlines the architectural shifts and technologies involved in this transformation.

Building Zone Failure Resilience in Apache Pinot™ at Uber

tech2
November 10, 2025 Apache Pinot

The article discusses Uber's implementation of zone failure resilience (ZFR) in Apache Pinot, a real-time analytics platform. It details the strategies used to ensure that Pinot can withstand zone failures without impacting queries or data ingestion. By leveraging instance assignment capabilities and integrating with Uber's isolation groups, the article outlines how they achieved a robust deployment model that enhances operational efficiency and reliability. The migration process for existing clusters to this new setup is also highlighted, showcasing the challenges and solutions involved.

How Ericsson achieves data integrity and superior governance with Dataplex

product
November 9, 2025 GCP Dataplex

The article discusses Ericsson's transformative journey towards data governance using Google Cloud's Dataplex Universal Catalog. It highlights the importance of data integrity and governance in modern telecommunications, particularly for Ericsson's Managed Services. The piece outlines the steps taken by Ericsson to operationalize its data strategy, emphasizing the need for clean, reliable data and the balance between compliance and innovation. It also touches on future priorities in AI-powered data governance and the lessons learned from their experience.

How Confluent Is Rebuilding Data Infrastructure for the Age of AI Agents

product
November 9, 2025

The article discusses how Confluent is evolving its data infrastructure to accommodate the demands of AI agents, emphasizing the need for real-time data processing capabilities. Key features introduced include Confluent Intelligence, a controlled stack for AI agent development, and the Real-Time Context Engine, which aims to provide timely data delivery. The article highlights the importance of integrating streaming data with AI systems to enhance decision-making and operational efficiency. It concludes by noting that the future of AI will depend on the robustness of the underlying data systems.

Supercharging the ML and AI Development Experience at Netflix with Metaflow

tech2
November 6, 2025 Metaflow

The article discusses the enhancements made to Metaflow, a framework for managing machine learning and AI workflows at Netflix. It introduces a new feature called Spin, which allows for rapid, iterative development similar to using notebook cells, enabling developers to quickly test and debug individual steps in their workflows. The article emphasizes the importance of state management and the differences between traditional software engineering and ML/AI development. It also highlights how Metaflow integrates with other tools to streamline the deployment process.

The Architectural Shift: AI Agents Become Execution Engines While Backends Retreat to Governance

vision
November 6, 2025

The article discusses a significant transformation in enterprise software architecture where AI agents evolve from assistive tools to operational execution engines. This shift is characterized by traditional backends transitioning to governance roles, particularly in sectors like banking, healthcare, and retail. The adoption of protocols like the Model Context Protocol (MCP) allows AI agents to directly invoke services and orchestrate workflows, leading to increased efficiency and autonomy in enterprise applications. Predictions indicate that by 2026, 40% of enterprise applications will incorporate such autonomous agents.

Identify User Journeys at Pinterest

tech2
November 5, 2025

The article discusses how Pinterest has developed a system to identify user journeys, which are sequences of user-item interactions that reveal user interests, intents, and contexts. By leveraging user data and machine learning techniques, Pinterest aims to enhance its recommendation system, moving beyond immediate interests to long-term user goals. The approach includes dynamic keyword extraction, clustering, and journey ranking, which collectively improve user engagement through personalized notifications. The article outlines the system architecture, key components, and the impact of journey-aware notifications on user interactions.

Designing AI-Driven User Memory for Personalization

tech2
November 4, 2025

The article discusses Zillow's innovative approach to personalization in the real estate market through AI-driven user memory. It emphasizes the importance of understanding user preferences and adapting to their evolving needs over time. The article outlines how Zillow combines batch and real-time data processing to create a dynamic user memory that enhances the home shopping experience. Key components include recency and frequency of user interactions, flexibility in preferences, and predictive modeling to anticipate user needs.

Exploring the Data Engineering Agent in BigQuery

product
November 3, 2025 GCP BigQuery GCP Dataplex

The article announces the preview of the Data Engineering Agent in BigQuery, designed to automate complex data engineering tasks. It highlights how the agent can streamline pipeline development, maintenance, and troubleshooting, allowing data professionals to focus on higher-level tasks. Key features include natural language pipeline creation, intelligent modifications, and integration with Dataplex for enhanced data governance. The article also shares positive feedback from early users, emphasizing the agent's potential to transform data engineering workflows. I would like to see how this help the data modelling tasks, usually DE team don't own this part or need cross ownership this with DA.

Requirement Adherence: Boosting Data Labeling Quality Using LLMs

tech2
October 28, 2025

The article discusses Uber AI Solutions' innovative in-tool quality-checking framework, Requirement Adherence, which enhances data labeling quality using large language models (LLMs). It outlines the challenges of traditional labeling workflows and presents a two-step approach involving rule extraction and in-tool validation. By leveraging LLMs, the framework efficiently identifies labeling errors in real-time, significantly reducing rework and costs for enterprise clients. The article emphasizes the importance of maintaining data privacy and the continuous improvement of the system through feedback mechanisms.

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning

tech2
October 27, 2025

This article discusses the challenges and advancements in post-training generative recommender systems, particularly focusing on a novel algorithm called Advantage-Weighted Supervised Fine-tuning (A-SFT). The authors highlight the limitations of traditional reinforcement learning methods in recommendation contexts, such as the lack of counterfactual observations and noisy reward models. A-SFT aims to improve recommendation quality by effectively combining supervised fine-tuning with reinforcement learning techniques. The results demonstrate that A-SFT outperforms existing methods in aligning generative models with user preferences.

From Word2Vec to LLM2Vec: How to Choose the Right Embedding Model for RAG

tech1
October 14, 2025

This article provides a comprehensive guide on selecting the appropriate embedding model for Retrieval-Augmented Generation (RAG) systems. It discusses the importance of embedding models in converting human language into machine-readable vectors and evaluates various types of embedding models, including sparse, dense, and hybrid models. Key factors for evaluating these models are outlined, such as context window, tokenization unit, dimensionality, and training data. The article concludes by emphasizing the need for practical testing with real-world data to ensure effective implementation.

Beyond classification: How AI agents are evolving Shopify's product taxonomy at scale

project
October 14, 2025

This article discusses how Shopify has transformed its product taxonomy management from manual processes to an AI-driven multi-agent system. The new system addresses challenges such as scaling taxonomy, maintaining consistency, and ensuring quality through automated analysis and domain expertise. By integrating real product data and employing specialized AI agents, Shopify can proactively adapt its taxonomy to meet evolving merchant and customer needs. The article highlights the efficiency gains and quality improvements achieved through this innovative approach.

Visualize data lineage using Amazon SageMaker Catalog for Amazon EMR, AWS Glue, and Amazon Redshift

tutorial
October 13, 2025

This article discusses how to visualize data lineage in Amazon SageMaker Catalog, integrating various AWS analytics services like AWS Glue, Amazon EMR, and Amazon Redshift. It provides a step-by-step guide on configuring resources and implementing data lineage tracking to enhance data governance and quality.

Scaling Subscriptions at The New York Times with Real-Time Causal Machine Learning

tech2
October 13, 2025

The article discusses how The New York Times has evolved its subscription strategy using real-time algorithms and causal machine learning to optimize its digital subscription funnel. It highlights the transition from static paywalls to dynamic decision-making processes that consider various business KPIs. The implementation of real-time algorithms allows for tailored user experiences based on engagement and conversion metrics, ultimately enhancing subscription and registration rates. The article emphasizes the importance of collaboration between data science and business leadership in defining objectives and constraints.

The Ultimate Guide to LLM Evaluation: Metrics, Methods & Best Practices

tech1
September 25, 2025

Amazon Strands Agents SDK: A technical deep dive into agent architectures and observability

product
August 5, 2025

2025 Top 10 Risk & Mitigations for LLMs and Gen AI Apps

vision
August 1, 2025

The Dataproc advantage: Advanced Spark features that will transform your analytics and AI

product
August 1, 2025 GCP Dataproc Apache Spark

Lightning Engine is open source?

Secure Your AI APIs with Apigee & Model Armor

product
July 29, 2025

Good features for the LLM API, in Azure you have the equivalent: https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities

Accelerating development with the AWS Data Processing MCP Server and Agent

project
July 24, 2025 AWS Glue AWS Athena

This Data Processing MCP Server can fully manage all the EMR, Athena and Glue services. You really don't need code any more...

Capture data lineage from dbt, Apache Airflow, and Apache Spark with Amazon SageMaker

product
July 16, 2025 Apache Spark AWS Datazone AWS Sagemaker Apache Airflow dbt

i don't know "OpenLineage standard" before, I guess Datahub should enable to support it as well.

Introducing Northguard and Xinfra: scalable log storage at LinkedIn

project
July 16, 2025

Even “more scaling” was the headline driver, the deeper motivation was to build a log‐storage system that could operate reliably, efficiently, and autonomously at hyperscale—solving Kafka’s metadata, coordination, and rebalancing challenges.

Advancing Enterprise AI: How Wix is Democratizing RAG Evaluation

project
July 16, 2025

Direct Data Sharing using Delta Sharing - Introduction: Our Journey to Empower Partners at Zalando

project
July 15, 2025 Delta Sharing Databricks Unity Catalog

i like this statement "We're not just building technology; we're building expertise." :)

Scaling recommendations service at OLX

project
July 14, 2025 ScyllaDB

Boost your Search and RAG agents with Vertex AI's new state-of-the-art Ranking API

product
July 13, 2025

this is exactly what i need "It takes the candidate list from your existing search or retrieval system and re-orders it based on deep semantic understanding", but how's the performance for the domain specific queries?

How good is your AI? Gen AI evaluation at every stage, explained

product
July 13, 2025

Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be | Part 1

vision
July 11, 2025

Still not fully see the vision of LH 2.0 ...

Taming complexity: An intuitive evaluation framework for agentic chatbots in business-critical environments

tech1
July 10, 2025

Announcing the Agent2Agent Protocol (A2A)

product
July 7, 2025

This is very interesting for my use case. It talk about some features I really needed like “Agent Card”, "long-running tasks" etc.

Refining input guardrails for safer LLM applications

project
July 5, 2025

Introduce using LLM as a “proxy defense,” which implements an additional component acting as a firewall to filter out unsafe user utterances.

Unlocking the Power of Customization: How Our Enrichment System Transforms Recommendation Data Enrichments

project
July 3, 2025

Data enrichment is a very good design pattern in the Recommendation system.

Empower financial analytics by creating structured knowledge bases using Amazon Bedrock and Amazon Redshift

product
May 29, 2025 AWS Redshift AWS Bedrock

Interested to know when an Amazon Bedrock knowledge base for the Redshift database, how to open the access for Apps with nature language.

Development Trends and Open Source Technology Practices of AI Agents

product
April 6, 2025

- Apache RocketMQ: Enhancing RAG Data Timeliness - Dynamic Prompt Data Updates - Full-Chain Data Quality Tracking

​The Model Context Protocol (MCP)

product
April 2, 2025

Instead of maintaining separate connectors for each data source, developers can now build against a standard protocol.

A Deep Dive into Agoda's Generic Reconciliation Platform

spike
March 31, 2025

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

project
March 28, 2025 AWS EventBridge

EMR showcase, but capture the EMR events into Lambda is interesting.

Journey of next generation control plane for data systems

project
March 27, 2025

How AI will disrupt data engineering as we know it

vision
March 25, 2025

Agreed, DE will more focus on enabling the data value rather just build pipelines. But eventually we will need less human resources i think.

Foundation Model for Personalized Recommendation

project
March 24, 2025

For the recommendation of the title, i think the semantic based is more efficient than the user behaviour based.

How to Systematically Improve RAG Applications

vision
March 21, 2025

Gartner Data Governance Maturity Model: What It Is, How It Works

vision
March 20, 2025

Schema Change Management at Halodoc

spike
March 18, 2025 AWS Glue AWS S3

Expressive Time Travel and Data Validation for Financial Workloads

product
March 17, 2025 Apache Iceberg AWS Database Migration Service

The validation and remediation are interesting.

Lithium: elevating ETL with ephemeral and self-hosted pipelines

project
March 14, 2025 Kafka Connectors

Interesting to see all the commands of input and output are by the Kafka topics events.

From Snowflake to Databricks: Our cost-effective journey to a unified data warehouse

project
March 13, 2025 Snowflake Databricks SQL Looker Studio

Snowflake or Databricks? You need make your choice ;)

Architecting Compliance: Cost-Effective Data Strategies for GDPR

project
March 12, 2025

PII fields are identified and separated stored is interesting, we would need to join later for any operations on it?

Creating source-aligned data products in Adevinta Spain

project
March 11, 2025

Time series forecasting with LLM-based foundation models and scalable AIOps on AWS

poc
March 10, 2025 AWS Sagemaker

Treat time series data as a language to be modeled by off-the-shelf transformer architectures.

Over 700 million events/second: How we make sense of too much data

project
March 7, 2025

Title Launch Observability at Netflix Scale - Part 3: System Strategies and Architecture

project
March 6, 2025

One Big Table (OBT)

tech1
March 5, 2025

From Lakehouse architecture to data mesh

project
March 4, 2025 Databricks Unity Catalog

modern data platform architecture based on Databrick tech stack.

Design patterns for implementing Hive Metastore for Amazon EMR on EKS

poc
March 3, 2025 Apache Hive Apache Spark

Data Pipelines Architecture at BlaBlaCar

project
February 28, 2025 Apache Airflow GCP Dataflow GCP BigQuery

Very classic MDS

Redefining Data Engineering with Go and Apache Arrow

poc
February 27, 2025 Apache Arrow DuckDB

Sounds very first, but if we work with big dataset, how to handle the data transformation in the memory? If we work with small data, we can rewrite into Parquet format and the performance is not an issue.

Data Products: A Case Against Medallion Architecture

vision
February 26, 2025

I read the blog but wasn’t fully convinced by its main argument. In my view, Medallion Architecture is just one way to manage data, and it doesn’t necessarily require physically moving or copying data between different stages. Simply tagging tables should be sufficient. Different stages can enforce distinct archival, retention policies, and operational processes. Additionally, from a high-level perspective, the concept of data products doesn’t fundamentally contradict Medallion Architecture.

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

project
February 25, 2025 AWS EventBridge AWS DynamoDB

Interesting architecture to handle bursty and unpredictable traffic on AWS

Towards composable data platforms

vision
February 25, 2025

My understanding of "Table Virtualization" is share the tables between two data platforms.

Open Source Data Engineering Landscape 2025

vision
February 24, 2025

Real good summary for the main tech products in the different categories of data industry!

The Unstructured Data Landscape

vision
February 23, 2025

The Evolution of Business Intelligence

vision
February 21, 2025

How Formula 1® uses generative AI to accelerate race-day issue resolution

project
February 20, 2025 AWS Glue AWS S3 AWS Bedrock AWS EventBridge

Very classic Glue job pipeline to feed the AWS Bedrock Knowledge Bases for a RAG use case.

How to use gen AI for better data schema handling, data quality, and data generation

product
February 19, 2025 GCP BigQuery

Some good usage of GCP gemini in your data engineering tasks, but I'm concern about my bill of GCP now ^^.

Scaling Large Language Models for e-Commerce: The Development of a Llama-Based Customized LLM

project
February 18, 2025

Introducing Impressions at Netflix (part 1)

project
February 17, 2025 Apache Flink Apache Iceberg

The Art of Secure Search: How Wix Mastered PII Data in Vespa Search Engine

project
February 16, 2025 Vespa

Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

project
February 15, 2025

Zenml vs flyte vs metaflow

tech1
February 14, 2025 ZenML Flyte Metaflow

From concept to reality: Navigating the Journey of RAG from proof of concept to production

vision
February 13, 2025 AWS Bedrock

summarise of all the concept and technologies to build a production ready RAG solution.

How Uber Uses Ray® to Optimize the Rides Business

project
February 7, 2025 Apache Spark Ray

very nice! Uber runs Ray instances inside Spark executors. This setup allows each Spark task to spawn Ray workers for parallel computation, which boosts performance significantly.

The foundations of Canva’s continuous data platform with Snowpipe Streaming

project
February 6, 2025 Snowflake

Paradigm Shifts in Data Processing for the Generative AI Era

vision
February 5, 2025

"AI-centric" data processing focuses on preparing and managing large-scale, multimodal datasets efficiently for AI model training, fine-tuning, and deployment, rather than traditional database queries. It involves optimizing computation across heterogeneous resources (CPUs/GPUs), improving data flow efficiency, and enabling scalability—all crucial for building next-generation AI models.

How the Apache Arrow Format Accelerates Query Result Transfer

product
February 3, 2025 Apache Arrow

Building effective agents

vision
January 31, 2025

Workflow: Evaluator-optimizer is interesting.

How Nielsen uses serverless concepts on Amazon EKS for big data processing with Spark workloads

project
January 30, 2025 Apache Spark

Running local mode Spark cluster in k8 pods to processing the small files coming, this mode is more efficient than running big Spark cluster to process huge amount files in batch.

Automated GenAI-driven search quality evaluation

project
January 29, 2025

How Meta discovers data flows via lineage at scale

project
January 28, 2025

Explained in the three systems (API, data warehouse, AI inference), how to efficiently collect and validate the Lineage metadata.

How Monzo Bank reduced cost of TTL from time series index tables in Amazon Keyspaces

project
January 27, 2025 Apache Cassandra

Monzo Bank optimized their data retention strategy in Amazon Keyspaces by replacing the traditional Time to Live (TTL) approach with a bulk deletion mechanism. By partitioning time-series data across multiple tables, each representing a specific time bucket, they can efficiently drop entire tables of expired data. This method significantly reduces operational costs associated with per-row TTL deletions.

Introducing Easier Change Data Capture in Apache Spark™ Structured Streaming

project
January 27, 2025 Apache Spark

The State Reader API enables users to access and analyze Structured Streaming's internal state data. Readers will learn how to leverage the new features to debug, troubleshoot, and analyze state changes efficiently, making streaming workloads easier to manage at scale.

Introducing Configurable Metaflow

product
January 26, 2025 Metaflow

Why Fluss? Top 4 Challenges of Using Kafka for Real-Time Analytics

product
January 25, 2025

Introducing Fluss: Streaming Storage for Real-Time Analytics

product
January 25, 2025 Fluss

JD.com's Exploration and Practice of Big Data Governance (CN)

project
January 24, 2025 Apache Spark Apache Hive

JD.com has developed a comprehensive big data governance framework to manage its extensive data infrastructure, which includes thousands of servers, exabytes of storage, and millions of data models and tasks. The governance strategy focuses on cost reduction, stability, security, and data quality. Key initiatives involve the implementation of audit logs, full-link data lineage, and automated governance platforms. These efforts aim to enhance data management efficiency, ensure data security, and optimize resource utilization across the organization.

An introduction to preparing your own dataset for LLM training

product
January 22, 2025

Introducing Onehouse Compute Runtime to Accelerate Lakehouse Workloads Across All Engines

product
January 21, 2025 Onehouse Apache Spark Apache Hudi

Part 3: A Survey of Analytics Engineering Work at Netflix

project
January 19, 2025

Part 2: A Survey of Analytics Engineering Work at Netflix

project
January 18, 2025

Part 1: A Survey of Analytics Engineering Work at Netflix

project
January 17, 2025 Data Junction

How EUROGATE established a data mesh architecture using Amazon DataZone

project
January 16, 2025 AWS Datazone

How TUI uses Amazon Bedrock to scale content creation and enhance hotel descriptions in under 10 seconds

project
January 15, 2025 AWS Bedrock AWS Sagemaker

Title Launch Observability at Netflix Scale - Understanding The Challenges

project
January 13, 2025

Title Launch Observability at Netflix Scale - Navigating Ambiguity

project
January 13, 2025

Interesting sharing about the project and system design stage.

Cloud Efficiency at Netflix

project
January 12, 2025

Good platform built for Finops

Jumia builds a next-generation data platform with metadata-driven specification frameworks

vision
January 11, 2025 Apache Iceberg Apache Airflow AWS Glue

Improving Search Ranking for Maps

project
January 10, 2025

Building a User Signals Platform at Airbnb

project
January 10, 2025 Apache Flink

A Journey Towards Unified Data Governance at bp

project
January 9, 2025

MLOps Best Practices - MLOps Gym: Crawl

vision
January 7, 2025 MLflow

Good summary and topics for ML ops.

From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC

product
January 6, 2025 AWS Bedrock

Introducing the Prompt Engineering Toolkit

project
January 5, 2025

Behind the platform: the journey to create the LinkedIn GenAI application tech stack

project
January 4, 2025

How Zalando optimized large-scale inference and streamlined ML operations on Amazon SageMaker

project
January 3, 2025 AWS Sagemaker

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

project
January 2, 2025 AWS Datazone

Automated granted the S3 data access to the Data consumers.

Apache Iceberg: The Hadoop of the Modern Data Stack?

vision
December 31, 2024 Apache Iceberg

Good summarise the current problem for using Iceberg system, but the new S3 Table looks addressing all these pain points.

DARWIN: Data Science and Artificial Intelligence Workbench at LinkedIn

project
December 30, 2024

Mastering Airflow DAG Standardization with Python’s AST: A Deep Dive into Linting at Scale

project
December 29, 2024 Apache Airflow

Practical text-to-SQL for data analytics

project
December 28, 2024

Very good sharing blog, a lot of tips to build a modern LLM RAG app.

Selecting a model for semantic search at Dropbox scale

project
December 27, 2024

Turbocharging Efficiency & Slashing Costs: Mastering Spark & Iceberg Joins with Storage-Partitioned

spike
December 26, 2024 Apache Spark Apache Iceberg

Leverage of Iceberg table, Data is partitioned and stored in a way that aligns with the join keys, enabling highly efficient joins with minimal data movement for Spark job.

A First Look at S3 (Iceberg) Tables

tech1
December 25, 2024 AWS Glue AWS S3 Apache Iceberg

S3 Table bucket handle the Iceberg compaction and catalog maintenance tasks for you.

Views pwn Tables as data interfaces

project
December 24, 2024 AWS Glue AWS Redshift AWS S3 AWS Athena

Twitch has leveraged Views in their Data Lake to enhance data agility, minimize downtime, and streamline development workflows. By utilizing Views as interfaces to underlying data tables, they've enabled seamless schema modifications, such as column renames and VARCHAR resizing, without necessitating data reprocessing. This approach has facilitated rapid responses to data quality issues and supported efficient ETL processes, contributing to a scalable and adaptable data infrastructure.

Amazon Q data integration adds DataFrame support and in-prompt context-aware job creation

product
December 23, 2024 AWS Glue AWS Sagemaker

Our journey to Snowflake monitoring mastery

project
December 22, 2024 dbt Snowflake

Parquet pruning in DataFusion

tech1
December 21, 2024 Apache DataFusion

Design multi-agent orchestration with reasoning using Amazon Bedrock and open source frameworks

product
December 20, 2024

How Amazon Ads uses Iceberg optimizations to accelerate their Spark workload on Amazon S3

project
December 19, 2024 Apache Iceberg Apache Spark AWS S3

Improving the data processing efficiency by implementing Apache Iceberg's base-2 file layout for S3.

How Twitch used agentic workflow with RAG on Amazon Bedrock to supercharge ad sales

product
December 17, 2024 AWS Bedrock

An agent breaks down the process of answering questions into multiple steps, and uses different tools to answer different types of questions or interact with multiple data sources, is a good practise.

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

product
December 16, 2024 AWS Glue Apache Iceberg

Without Iceberg, there are lot of overhead works to implement WAP pattern.

LLM-powered data classification for data entities at scale

project
December 15, 2024

Grab's Data Engineering and Data Governance teams collaborated to automate metadata generation and sensitive data identification using Large Language Models (LLMs). This initiative aimed to enhance data discovery and streamline access management across the organization.

Metasense V2: Enhancing, improving and productionisation of LLM powered data governance

spike
December 15, 2024

Grab's Data Engineering and Data Governance teams enhanced their Large Language Model (LLM) integration to automate metadata generation and data classification. Post-rollout improvements focused on refining model accuracy, reducing manual verification, and increasing scalability across the data lake.

Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt

product
December 13, 2024 AWS Glue AWS S3 AWS Neptune AWS Redshift AWS Athena dbt

Build a process to built the complete data lineage information by merging the partial lineage generated by dbt automatically.

Lucene: Uber’s Search Platform Version Upgrade

project
December 12, 2024 Apache Lucene

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

spike
December 10, 2024 Apache Iceberg

BMW Cloud Data Hub: A reference implementation of the modern data architecture on AWS

project
December 9, 2024 AWS Glue AWS S3 AWS Athena

How BMW streamlined data access using AWS Lake Formation fine-grained access control

project
December 8, 2024 AWS Glue AWS DynamoDB AWS Athena AWS Lakeformation

With AWS LakeFormation, creating filter packages and controlling access. A filter package provides a restricted view of a data asset by defining column and row filters on the tables.

Building a GDPR compliance solution with Amazon DynamoDB

project
December 7, 2024 AWS Glue AWS S3 AWS DynamoDB

To search user profiles to remove, we use an AWS Lambda function that queries Aurora, DynamoDB, and Athena and places those locations in a DynamoDB table specifically for GDPR requests.

Automation Platform v2: Improving Conversational AI at Airbnb

project
December 5, 2024

Netflix’s Distributed Counter Abstraction

project
December 4, 2024 Apache Cassandra

Netflix's Distributed Counter Abstraction is a scalable service designed to handle high-throughput counting operations with low latency. It supports two primary counter types: Best-Effort, which offers near-immediate access with potential slight inaccuracies, and Eventually Consistent, which ensures accurate counts with minimal delays. This abstraction is built atop Netflix's TimeSeries Abstraction and is managed via the Data Gateway Control Plane, allowing for flexible configuration and global deployment.

Introducing Netflix’s TimeSeries Data Abstraction Layer

project
December 2, 2024 Apache Cassandra Elasticsearch

Netflix's TimeSeries Data Abstraction Layer is designed to efficiently store and query vast amounts of temporal event data with low millisecond latency. It addresses challenges such as high throughput, efficient querying of large datasets, global read and write operations, tunable configurations, handling bursty traffic, and cost efficiency. The abstraction integrates with storage backends like Apache Cassandra and Elasticsearch, offering flexibility and scalability to support Netflix's diverse use cases.

The infrastructure behind AI search in Figma

project
December 1, 2024 AWS OpenSearch

Governing the ML lifecycle at scale, Part 3: Setting up data governance at scale

product
November 28, 2024 AWS Datazone AWS Athena

I think the challenges part still is the adaption of this kind of system, integration and change the workflow is a huge cost for the stakeholder team.

58 Group optimized its data integration platform using Apache SeaTunnel (CN)

project
November 28, 2024 Apache SeaTunnel

58 Group optimized its data integration platform using Apache SeaTunnel to handle over 500 billion daily data messages efficiently. This effort addressed challenges such as high reliability, throughput, low latency, and simplified maintenance. By evolving from Kafka Connect to SeaTunnel, the architecture now supports diverse data sources, enhanced task management, and real-time monitoring, with future plans to leverage AI for diagnostics and transition to cloud environments.

Self-Serve Platform for Scalable ML Recommendations

project
November 26, 2024

Very flexible and scalable recommendation solution

From Data to Insights: Segmenting Airbnb’s Supply

project
November 26, 2024

Calculate new features for the user segmentation, and a good share for the validation.

Introducing generative AI troubleshooting for Apache Spark in AWS Glue (preview)

product
November 25, 2024 AWS Glue

Classic RAG solution for this kind of application.

A brief history of Notion’s data catalog

project
November 24, 2024 Datahub

Our team had the similar problem "Despite this integration’s technical success, we soon noticed that the new system was delivering lower-than-expected user engagement." Nice re-thinking about the improvement built!

Loading data into Redshift with DBT

project
November 23, 2024 dbt AWS Redshift

Why we load the S3 data into the Redshift again? it already queryable via Redshift Spectrum? I guess it's for the performance? Transform the S3 raw data, build the data models and write back into S3?

From dbt to SQLMesh

project
November 21, 2024 SqlMesh

4 Key Benefits of Shift Left

vision
November 20, 2024

This is one of big difference comparing to DE with SE.

Right-sizing Spark executor memory

project
November 19, 2024 Apache Spark

Presto® Express: Speeding up Query Processing with Minimal Resources

project
November 19, 2024 PrestoDB

Dynamic Data Pipelines with Airflow Datasets and Pub/Sub

product
November 18, 2024 Apache Airflow

Good "dataset" feature since 2.4.0, released on September 19, 2022.

The Analytics Development Lifecycle (ADLC)

vision
November 18, 2024 dbt

Fully adapt the SDLC practises for analytic world ...

Data Vault 2.0 on Google Cloud BigQuery

tech2
November 16, 2024 GCP BigQuery

This paper provides an overview of the Data Vault concept and the business benefits of leveraging it on the cloud-based enterprise database BigQuery.

Development and Comparative Analysis of Data Lake Storage Acceleration Solutions (CN)

product
November 15, 2024 Alluxio

Comprehensive explanation of how Alluxio accelerates data access in the cloud.

Introducing Netflix’s Key-Value Data Abstraction Layer

project
November 14, 2024 Apache Cassandra

This is mostly like a Netflix level's problem, huge engineering works to build this KV data abstract layer.

Dataplex Automatic Discovery makes Cloud Storage data available for Analytics and governance

product
November 12, 2024 GCP Dataplex

I agree with the 'dark data' problem in large organizations, and tools like Dataplex can help by automating data discovery. However, with thousands of tables generated, it raises the question: who will sift through these massive results to identify truly valuable datasets? This process could be very time-consuming.

Leveraging RAG-powered LLMs for Analytical Tasks

project
November 11, 2024

Using LLM RAG to fetch the right dataset and combined auto enhanced explanation and analysis for the users is really a good idea.

What goes into bronze, silver, and gold layers of a medallion data architecture?

vision
November 10, 2024

This article discusses an approach similar to the raw, curated, and delivery zones we've talked about before. The key concept is to process and manage data in distinct zones or stages to support data governance and optimize data usage. Most data teams will likely need to adopt some version of this architecture to efficiently handle and control large volumes of data assets.

Building data pipelines effortlessly with a DAG Builder for Apache Airflow

project
November 8, 2024 Apache Airflow

QuintoAndar's DAG Builder allows scalable management of 10,000+ Apache Airflow DAGs by using YAML configurations to generate DAGs, minimizing code duplication and standardizing data pipeline creation. By separating DAG structures from workflow-specific parameters, QuintoAndar enables data engineers to create new pipelines through declarative YAML files, streamlining the process and ensuring quality across pipelines. This system improves team productivity, simplifies code maintenance, and reduces the learning curve for new team members.

A practical guide to synthetic data generation with Gretel and BigQuery DataFrames

product
November 5, 2024

This guide demonstrates integrating Gretel with BigQuery DataFrames for synthetic data generation. By leveraging BigQuery's pandas-compatible APIs and Gretel's machine learning tools, users can generate and de-identify high-quality synthetic data that maintains data privacy and regulatory compliance. The process includes data de-identification with Gretel's Transform v2 and synthetic data generation with Gretel Navigator Fine Tuning, optimized for handling patient records with complex data relationships.

Automate fine-tuning of Llama 3.x models with the new visual designer for Amazon SageMaker Pipelines

product
October 28, 2024 AWS Sagemaker

AWS introduces a visual designer in SageMaker Pipelines to simplify fine-tuning and deploying Llama 3.x models. This new UI allows users to create, manage, and automate workflows for continuous model updates using a no-code interface. The article details a sample pipeline for customizing LLMs with SEC financial data, enabling tasks like model evaluation, deployment, and conditional registration based on performance.

Data Gateway — A Platform for Growing and Protecting the Data Tier

product
October 10, 2024 Apache Cassandra

Netflix's Data Gateway platform abstracts the complexities of distributed databases, providing scalable, secure, and reliable data access layers (DAL) through standardized gRPC and HTTP APIs.

Ray Infrastructure at Pinterest

project
September 7, 2024 Ray

Pinterest's journey of adopting Ray for infrastructure enhancement started in 2023. It involved overcoming challenges like Kubernetes integration, optimizing resource utilization, and ensuring security. The Ray infrastructure enables scalable, efficient machine learning workloads, significantly improving last-mile data processing, batch inference, and recommender systems model training. By focusing on distributed processing, cost management, and developer velocity, Pinterest achieved improved scalability and operational efficiency for its machine learning applications.

Last Mile Data Processing with Ray

project
September 4, 2024 Ray

Pinterest enhances machine learning dataset iteration speed by adopting Ray for distributed processing, addressing bottlenecks in dataset handling for recommender systems. Previously slow processes involving Apache Spark and Airflow workflows now leverage Ray's parallelization, resulting in a significant reduction in training time. Ray’s support for CPU/GPU resource management and streaming execution has led to increased throughput and cost savings, improving ML engineer velocity and overall efficiency in managing large-scale data.

Building a data mesh to support an ecosystem of data products at Adevinta

vision
September 3, 2024

Adevinta's Central Product and Tech department has implemented a data mesh architecture to manage and deliver data products across its marketplaces. The initiative emphasizes domain-specific datasets, SQL accessibility, and datasets as products, with a focus on improving decision-making for data analysts, scientists, and product managers. Key strategies include centralized governance, domain-oriented data, and the establishment of working agreements to ensure data quality and alignment across decentralized teams.

Recommending for Long-Term Member Satisfaction at Netflix

project
September 2, 2024

This article discusses how Netflix enhances long-term member satisfaction through personalized recommendations. By moving beyond traditional metrics like clicks or CTR, Netflix uses reward engineering to optimize for long-term satisfaction. The process involves defining proxy rewards based on user interactions, predicting delayed feedback, and aligning recommendations with long-term engagement. Challenges include dealing with delayed feedback, the disparity between online and offline metrics, and refining proxy rewards to better align with long-term satisfaction.

Unlocking Insights with High-Quality Dashboards at Scale

product
August 29, 2024 Tableau GCP BigQuery Looker Studio

The article discusses Spotify's approach to creating and managing high-quality dashboards at scale. Spotify utilizes Tableau and Looker Studio as primary tools, supported by a Dashboard Quality Framework that ensures consistency and trust in the dashboards. The framework includes automatic checks ('Vital Signs') and a manual design checklist ('Spicy Dashboard Design'). The Dashboard Portal centralizes dashboard access, offering search, curation, and quality labeling features, enhancing the overall accessibility and reliability of dashboards across the company.

Automating Data Protection at Scale, Part 1

project
August 28, 2024

This article discusses Airbnb's development of a comprehensive Data Protection Platform (DPP) to address challenges in data security and privacy compliance. The platform integrates various services like Madoka for metadata management, Inspekt for data classification, and Cipher for encryption. It highlights the need for automated data protection due to the complexity of handling sensitive data across different environments and the importance of complying with global regulations like GDPR and CCPA.

Choosing a Data Quality Tool

product
April 6, 2022 Great Expectations Metaplane Lightup Bigeye Datafold Monte Carlo Data Soda

It's a good high level summary, but i think each team still need make some spike to find out the suitable tools for their use case and project.

From Rows to People

product
April 3, 2022 Zingg

The Unbundling of Airflow

vision
March 29, 2022 Apache Airflow dbt fal dbt

This article is for talk about the idea behind fal dbt, extend the dbt capability on airflow platform. It also talk about a lot of other popular tools on Airflow.

Rebundling the Data Platform

product
March 28, 2022 Dagster

I think Dagster has zoom in from Job level view to the asset/table level view for the pipelines. There is always having the Pro and Cons.

Google Cloud helps UK-based fintech Fluidly scale to 50,000 customers and beyond

product
March 24, 2022

It's a good showcase blog for GCP, but it would be very interesting to see some more detail about how Fluidly data team leverage GCP to launch their new data driven business products.

Why We Switched Our Data Orchestration Service

project
March 23, 2022 Flyte

Introducing Natural Language Search for Podcast Episodes

project
March 23, 2022 Vespa

Build event-driven data quality pipelines with AWS Glue DataBrew

tutorial
January 18, 2022 AWS Glue DataBrew

The highlight is configure a data profile jobs without one line of code or SQL, and you have a good UI to check the job output.

The Rise (and Lessons Learned) of ML Models to Personalize Content on Home (Part I)

team
December 31, 2021

Power highly resilient use cases with Amazon Redshift

poc
December 30, 2021

Supercharging Apache Superset

team
September 23, 2021

How Airbnb Built “Wall” to prevent data bugs

team
August 19, 2021

How Uber Achieves Operational Excellence in the Data Quality Experience

team
August 19, 2021

Data Quality at Airbnb

team
August 19, 2021

Unified Flink Source at Pinterest: Streaming Data Processing

tech1
August 16, 2021

Introducing DataFu-Spark

tutorial
August 16, 2021

Scaling data analytics with software engineering best practices

team
August 16, 2021

Distributing the data team to boost innovation reliably

team
August 16, 2021

Easily manage your data lake at scale using AWS Lake Formation Tag-based access control

poc
August 12, 2021

Secure multi-tenant data ingestion pipelines with Amazon Kinesis Data Streams and Kinesis Data Analytics for Apache Flink

poc
August 12, 2021

Query an Apache Hudi dataset in an Amazon S3 data lake with Amazon Athena part 1: Read-optimized queries

poc
August 12, 2021

A Tale of spark JSON data source

tech1
August 9, 2021

Using Amazon Macie to Validate S3 Bucket Data Classification

poc
August 9, 2021

Getting to know our Engineers at Funding Circle: Q&A with Lin Han

team
August 9, 2021

How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020

tech1
July 6, 2021

How does Airbnb track and measure growth marketing

team
July 5, 2021

Data as a Product: Applying a Product Mindset to Data at Netflix

vision

The article discusses the importance of treating data as a product at Netflix, emphasizing a product mindset that focuses on intentional design, clear ownership, and continuous evaluation to enhance data utility and trust. It outlines key principles for data products, including clear purpose, defined users, and lifecycle management.

Why We Bet on Rust to Supercharge Feature Store at Agoda

project

The article discusses Agoda's decision to migrate their Feature Store Serving system from a JVM-based stack to Rust, driven by performance and reliability challenges. It details the migration process, including the proof of concept, performance benchmarks, and the importance of shadow testing to ensure correctness. The transition resulted in significant efficiency gains, handling five times more traffic while drastically reducing CPU and memory usage, leading to substantial cost savings. The article emphasizes the role of AI tools and the Rust compiler in facilitating the team's adoption of Rust despite their initial lack of experience.

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

tech2

This article provides an overview of four primary methods for evaluating large language models (LLMs): multiple-choice benchmarks, verifiers, leaderboards, and LLM judges. It discusses the advantages and limitations of each method, emphasizing the importance of understanding these evaluation techniques for better interpreting model performance. The article also includes code examples for implementing these evaluation methods from scratch, making it a valuable resource for practitioners in the field of LLM development and evaluation.

Building End-to-End Data Lineage for One-Time and Complex Queries Using Amazon Athena, Amazon Redshift, Amazon Neptune, and dbt

tech2
dbt AWS Redshift AWS Datazone AWS Neptune

This article discusses the challenges and solutions for building end-to-end data lineage in enterprise data analytics, particularly for one-time and complex queries. It highlights the use of Amazon Athena, Amazon Redshift, Amazon Neptune, and dbt to create a unified data modeling language across different platforms. The authors explain how to automate the data lineage generation process using AWS services like Lambda and Step Functions, ensuring accuracy and scalability. The article provides insights into the architecture and implementation details necessary for effective data lineage tracking.

Inside the Race to Build Agent-Native Databases

vision

The article discusses the evolving landscape of databases as they adapt to meet the needs of emerging AI agents. It highlights four innovative initiatives: AgentDB, which treats databases as disposable files; 'Postgres for Agents,' enhancing PostgreSQL for agent use; Databricks Lakebase, merging transactional and analytical capabilities; and Bauplan Labs, focusing on safety and reliability in data operations. These initiatives reflect a broader trend of reimagining database functionality to better serve machine users in an agent-native world.

Move Beyond Chain-of-Thought with Chain-of-Draft on Amazon Bedrock

tech2
AWS Bedrock AWS Database Migration Service

The article discusses the Chain-of-Draft (CoD) prompting technique, which offers a more efficient alternative to the traditional Chain-of-Thought (CoT) method for large language models. CoD reduces verbosity and improves cost efficiency and response times by limiting reasoning steps to five words or less. The authors demonstrate the implementation of CoD using Amazon Bedrock and AWS Lambda, showcasing significant reductions in token usage and latency while maintaining accuracy. The article emphasizes the practical benefits of CoD for organizations scaling their generative AI implementations.

Halodoc’s Layered Data Validation Strategy for Building Trust in the Lakehouse

project
AWS Database Migration Service Great Expectations AWS Redshift AWS Athena

The article outlines Halodoc's comprehensive approach to data validation within a Lakehouse architecture, emphasizing the importance of data accuracy and reliability. It describes a multi-layered validation strategy that employs AI to enhance data quality checks at various stages of the data pipeline. The validation layers include checks for data consistency, structural correctness, business correctness, and reconciliation, ensuring that data remains trustworthy throughout its journey. The implementation of this strategy has led to reduced data incidents and increased trust among analytics and product teams.