- Middleboxs and NFV
- Network Abstractions
- eBPF and XDP
- Transport Protocol
- Microservice and Service Mesh
- Network Stack
- Workload Interference
- Internet Architecture
- Container Networking
- The Click Modular Router, TOCS '00
- Middleboxes No Longer Considered Harmful, OSDI '04
- Making Middleboxes Someone Else’s Problem: Network Processing as a Cloud Service, SIGCOMM '12
- Design and Implementation of a Consolidated Middlebox Architecture, NSDI '12
- Today's middleboxes are independent, specialized boxes. CoMb consolidates middleboxes to exploit multiplexing, module reuse, and spatial distribution
- Split/Merge: System Support for Elastic Execution in Virtual Middleboxes, NSDI'14
- Autoscaling of stateful network functions
- ClickOS and the Art of Network Function Virtualization, NSDI'14
- Enforcing Network-Wide Policies in the Presence of Dynamic Middlebox Actions using FlowTags, NSDI'14
- BlindBox: Deep Packet Inspection over Encrypted Traffic, SIGCOMM'15
- Rollback-Recovery for Middleboxes, SIGCOMM'15
- NetBricks: Taking the V out of NFV, OSDI '16
- OpenBox: A Software-Defined Framework for Developing, Deploying, and Managing Network Functions, SIGCOMM'16
- Paving the Way for NFV: Simplifying Middlebox Modifications Using StateAlyzr, NSDI'16
- Stateless Network Functions: Breaking the Tight Coupling of State and Processing, NSDI'17
- NFP: Enabling Network Function Parallelism in NFV, SIGCOMM'17
- NFVnice: Dynamic Backpressure and Scheduling for NFV Service Chains, SIGCOMM'17
- Metron: NFV Service Chains at the True Speed of the Underlying Hardware, NSDI'18
- Elastic Scaling of Stateful Network Functions, NSDI'18
- ResQ: Enabling SLOs in Network Function Virtualization, NSDI'18
- Microboxes: High Performance NFV with Customizable, Asynchronous TCP Stacks and Dynamic Subscriptions, SIGCOMM '18
- ClickNF: a Modular Stack for Custom Network Functions, ATC '18
- FlowBlaze: Stateful Packet Processing in Hardware, NSDI '19
- Performance Contracts for Software Network Functions, NSDI '19
- Correctness and Performance for Stateful Chained Network Functions, NSDI '19
- Verifying software network functions with no verification expertise, SOSP '19
- Gallium: Automated Software Middlebox Offloading to Programmable Switches, SIGCOMM '20
- TEA: Enabling State-Intensive Network Functions on Programmable Switches, SIGCOMM '20
- Contention-Aware Performance Prediction For Virtualized Network Functions, SIGCOMM '20
- SNF: serverless network functions, SoCC '20
- Performance Interfaces for Network Functions, NSDI '22
- Quadrant: A Cloud-Deployable NF Virtualization Platform, SoCC '22
- A High-Speed Stateful Packet Processing Approach for Tbps Programmable Switches, NSDI '23
- ExoPlane: An Operating System for On-Rack Switch Resource Augmentation, NSDI '23
- LemonNFV: Consolidating Heterogeneous Network Functions at Line Speed, NSDI '23
- Consolidate unmodified NFs that are implemented in different platforms (e.g., Snort, Click, and NetBricks)
- Disaggregating Stateful Network Functions, NSDI '23
- Automatic Parallelization of Software Network Functions, NSDI '24
- Chimera: A Declarative Language for Streaming Network Traffic Analysis, Security '12
- Abstractions for network update, SIGCOMM '12
- Compiling Path Queries, NSDI '16
- SNAP: Stateful Network-Wide Abstractions for Packet Processing, SIGCOMM '16
- mOS: A Reusable Networking Stack for Flow Monitoring Middleboxes, NSDI'17
- Quantitative Network Monitoring with NetQRE, SIGCOMM '17
- Language-Directed Hardware Design for Network Performance Monitoring, SIGCOMM '17
- Sonata: query-driven streaming network telemetry, SIGCOMM '18
- Lyra: A Cross-Platform Language and Compiler for Data Plane Programming on Heterogeneous ASICs, SIGCOMM '20
- Lucid: a language for control in the data plane, SIGCOMM '21
- Programming Network Stack for Middleboxes with Rubik, NSDI '21
- Designed a language for programming middleboxes with an emphasis on supporting various transport protocols and flexible network stack hierarchy.
- SwiSh: Distributed Shared State Abstractions for Programmable Switches, NSDI '22
- NetRPC: Enabling In-Network Computation in Remote Procedure Calls, NSDI '23
- ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks, SIGCOMM '23
eBPF and XDP (See Also awesome-ebpf)
- The eXpress data path: fast programmable packet processing in the operating system kernel, CoNEXT '18
- hXDP: Efficient Software Packet Processing on FPGA NICs, OSDI '20
- Specification and verification in the field: Applying formal methods to BPF just-in-time compilers in the Linux kernel, OSDI '20
- BPF for storage: an exokernel-inspired approach, HotOS '21
- BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing, NSDI '21
- Synthesizing Safe and Efficient Kernel Extensions for Packet Processing, SIGCOMM '21
- ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling, SOSP '21
- Syrup: User-Defined Scheduling Across the Stack, SOSP '21
- LiteFlow: towards high-performance adaptive neural networks for kernel datapath, SIGCOMM '22
- XRP: In-Kernel Storage Functions with eBPF, OSDI '22
- Electrode: Accelerating Distributed Protocols with eBPF, NSDI '23
- Tigger: A Database Proxy That Bounces With User-Bypass, VLDB '23
- Automatic Kernel Offload Using BPF, HotOS '23
- EPF: Evil Packet Filter, ATC '23
- DINT: Fast In-Kernel Distributed Transactions with eBPF, NSDI '24
- FetchBPF: Customizable Prefetching Policies in Linux with eBPF, ATC '24
- eTran: Extensible Kernel Transport with eBPF, NSDI '25
- Data Center TCP (DCTCP), SIGCOMM '10
- pFabric: minimal near-optimal datacenter transport, SIGCOMM '13
- TIMELY: RTT-based Congestion Control for the Datacenter, SIGCOMM '15
- The QUIC Transport Protocol: Design and Internet-Scale Deployment, SIGCOMM '17
- Credit-Scheduled Delay-Bounded Congestion Control for Datacenters, SIGCOMMM '17
- Re-architecting datacenter networks and stacks for low latency and high performance, SIGCOMMM '17
- Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities, SIGCOMMM '18
- HPCC: high precision congestion control, SIGCOMM '19
- R2P2: Making RPCs first-class datacenter citizens, ATC '19
- Swift: Delay is Simple and Effective for Congestion Control in the Datacenter, SIGCOMM '20
- Aeolus: A Building Block for Proactive Transport in Datacenters, SIGCOMM '20
- PowerTCP: Pushing the Performance Limits of Datacenter Networks, NSDI '21
- TCP is Harmful to In-Network Computing: Designing a Message Transport Protocol (MTP), HotNets '21
- Towards Domain-Specific Network Transport for Distributed DNN Training, NSDI '24
- MTP: A Transport for In-Network Computing, NSDI '25
- Microservices: yesterday, today, and tomorrow, Springer '17
- One of the first academic papers on microservices.
- Verification in the Age of Microservices, HotOS '17
- Service Fabric: A Distributed Platform for Building Microservices in the Cloud, EuroSys '18
- A description of the Azure SF design, with a focus on how they solved hard consistency and distributed systems problems.
- Overload Control for Scaling WeChat Microservices, SoCC '18
- µTune: Auto-Tuned Threading for OLDI Microservices, OSDI '18
- An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems, ASPLOS '19
- Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices, ASPLOS '19
- PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services, ASPLOS '19
- E3: Energy-Efficient Microservices on SmartNIC-Accelerated Servers, ATC '19
- Autopilot: workload autoscaling at Google, EuroSys '20
- FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices, OSDI '20
- Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale, ASPLOS '20
- A study on how microservices spend their CPU cycles. It shows that, within Facebook, microservices spend only a small fraction of their execution time service core application logic, and significant cycles on orchestration work (e.g., compression, serialization, and I/O processing).
- Nightcore: Efficient and Scalable Serverless Computing for Latency-Sensitive, Interactive Microservices, ASPLOS '21
- Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices, ASPLOS '21
- Sinan: ML-Based and QoS-Aware Resource Management for Cloud Microservices, ASPLOS '21
- Automatic Policy Generation for Inter-Service Access Control of Microservices, Security '21
- Static Analysis for invocation logic + abstraction for graph policy enforcement
- Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis, SoCC '21
- SHOWAR: Right-Sizing And Efficient Scheduling of Microservices, SoCC '21
- Service-Level Fault Injection Testing, SoCC '21
- Leveraging Service Meshes as a New Network Layer, HotNets '21
- Highlighted service mesh as an abstraction and discussed some use cases and challenges of SM.
- SHOWAR: Right-Sizing And Efficient Scheduling of Microservices, SoCC '21
- DeepRest: Deep Resource Estimation for Interactive Microservices, EuroSys '22
- CRISP: Critical Path Analysis of Large-Scale Microservice Architectures, ATC '22
- Uber's production-grade microservice tracing system for critical path analysis (CPA), built on top of Jaeger.
- Section 7.2 has some interesting data on Uber's microservices in production.
- SPRIGHT: Extracting the Server from Serverless Computing! High-performance eBPF-based Event-driven, Shared-memory Processing, SIGCOMM '22
- Accelerate service mesh (in serverless deployments) using eBPF and shared memory.
- DeepScaling: Microservices AutoScaling for Stable CPU Utilization in Large Scale Cloud Systems, SoCC '22
- The Power of Prediction: Microservice Auto Scaling via Workload Learning, SoCC '22
- Executing Microservice Applications on Serverless, Correctly, POPL '23
- The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems, NSDI '23
- Nodens: Enabling Resource Efficient and Fast QoS Recovery of Dynamic Microservice Applications in Datacenters, ATC '23
- Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows, ATC '23
- ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta, OSDI '23
- Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code, SIGCOMM '23
- Dissecting Overheads of Service Mesh Sidecars, SoCC '23
- LatenSeer: Causal Modeling of End-to-End Latency Distributions by Harnessing Distributed Tracing, SoCC '23
- Expressive Policies For Microservice Networks, HotNets '23
- Language and system support for complex safety properties that reason about the flow of requests across the whole microservice network (not just between adjacent hops).
- Application Defined Networks, HotNets '23
- Cilantro: Performance-Aware Resource Allocation for General Objectives via Online Feedback, OSDI '23
- Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications, SOSP '23
- MuCache: a General Framework for Caching in Microservice Graphs, NSDI '24
- Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices, NSDI '24
- TraceWeaver: Distributed Request Tracing for Microservices Without Application Modification, SIGCOMM '24
- TopFull: An Adaptive Top-Down Overload Control for SLO-Oriented Microservices, SIGCOMM '24
- Canal Mesh: A Cloud-Scale Sidecar-Free Multi-Tenant Service Mesh Architecture, SIGCOMM '24
- Derm: SLA-aware Resource Management for Highly Dynamic Microservices, ISCA '24
- netmap: A Novel Framework for Fast Packet I/O, ATC '12
- Chronos: Predictable Low Latency for Data Center Applications, SoCC'12
- Improving Network Connection Locality on Multicore Systems, EuroSys'12
- MegaPipe: A New Programming Interface for Scalable Network I/O, OSDI'12
- mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems, NSDI '14
- Network stack specialization for performance, SIGCOMM '14
- IX: A Protected Dataplane Operating System for High Throughput and Low Latency, OSDI '14
- Arrakis: The Operating System is the Control Plane, OSDI '14
- StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs, ATC '16
- ModNet: A Modular Approach to Network Stack Extension, NSDI '15
- RSS++: load and state-aware receive side scaling, CoNEXT '19
- TAS: TCP Acceleration as an OS Service, EuroSys '19
- Accelates TCP stack by splitting the stack into a "fast" data path (for data transport of established connections) and a control plane (for connection and context management, congestion control etc.).
- Snap: a Microkernel Approach to Host Networking, SOSP '19
- SocksDirect: Datacenter Sockets can be Fast and Compatible, SIGCOMM '19
- Understanding Host Network Stack Overheads, SIGCOMM '20
- The nanoPU: A Nanosecond Network Stack for Datacenters, OSDI '21
- How to diagnose nanosecond network latencies in rich end-host stacks, NSDI '22
- Remote Procedure Call as a Managed System Service, NSDI '23
- NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs, SIGCOMM '23
- Fathom: Understanding Datacenter Application Network Performance, SIGCOMM '23
- A Cloud-Scale Characterization of Remote Procedure Calls, SOSP '23
- HydraRPC: RPC in the CXL Era, ATC '24
- Q-Clouds: Managing Performance Interference Effects for QoS-Aware Clouds, EuroSys '10
- Profiling applications' performance in a standalone mode and using that to provide a baseline target when consolidating them onto a shared host.
- Cuanta: quantifying effects of shared on-chip resource interference for consolidated virtual machines, SoCC '11
- Introduced a cache loader micro-benchmark to profile application performance under varying cache-usage pressure and use the profile to predict the impact of cache interference among consolidated workloads
- Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations, MICRO '11
- Each application is profiled 1) using a memory antagonist to obtain its (memory) sensitivity curve and 2) to measure the pressure on the memory it generates.
- Toward Predictable Performance in Software Packet-Processing Platforms, NSDI '12
- Profile each NF’s cache ref/sec running alone and its performance drop curve when collocating with a synthetic antagonist. Predict the performance drop with these profiles.
- DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments, ATC '13
- Detect interference via differential low-level metrics (see Table 1), validate the interference and identify the interfering resource by running the victim in isolation, and mitigate interference via migration.
- Bobtail: Avoiding Long Tails in the Cloud, NSDI '13
- Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters, ASPLOS '13
- CPI2 : CPU performance isolation for shared compute clusters, EuroSys '13
- Uses cycles-per-instruction (CPI) as metrics to detect workload interference and identify perpetrators (and address the interference by throttling). Key takeaway: CPI correlates with application performance and CPI is a stable metrics.
- Reconciling High Server Utilization and Sub-millisecond Quality-of-Service, EuroSys '14
- Co-location leads to increases in queuing delay, scheduling delay, and thread load imbalance. Addresses interference online via re-provisioning and scheduling.
- Heracles: Improving resource efficiency at scale, ISCA '15
- Manage workload (LC+BE) colocations via an online controller that monitors latency and resource usage and manages the isolation mechanism for different resources.
- PerfIso: Performance Isolation for Commercial Latency-Sensitive Services, ATC '18
- Described a production system (Microsoft Bing) for performance isolation
- PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services, ASPLOS '19
- Online monitoring that detects QoS violations in O(100ms) and boosts the resource allocation of victims.
- PicNIC: predictable virtualized NIC, SIGCOMM '19
- Characterize how performance isolation can break in virtualized network stack in terms of network bandwidth and network stack processing rate. Provides an abstraction and construct based on bandwidth, latency, and loss rate to detect isolation breakdown and enforce isolation.
- Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads, NSDI '19
- Caladan: Mitigating Interference at Microsecond Timescales, OSDI '20
- Uses a set of control signals and corresponding actions to detect and respond to interference over microsecond timescales.
- FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices, OSDI '20
- Use online telemetry data (resource usage and latency) and offline learned models to detect and localize microservices that cause SLO violations and mitigate violations via dynamic re-provisioning.
- Architectural considerations for a new generation of protocols, SIGCOMM CCR '90
- A Data-Oriented (and Beyond) Network Architecture, SIGCOMM '07
- Networking named content, CoNEXT '12
- XIA: Efficient Support for Evolvable Internetworking, NSDI '12
- Serval: An End-Host Stack for Service-Centric Networking, NSDI '12
- Enabling a Permanent Revolution in Internet Architecture, SIGCOMM '19
- Slipstream: Automatic Interprocess Communication Optimization, ATC '15
- Slacker: Fast Distribution with Lazy Docker Containers, FAST '16
- Improving Docker Registry Design Based on Production Workload Analysis, FAST '18
- Cntr: Lightweight OS Containers, ATC '18
- Iron: Isolating Network-based CPU in Container Environments, NSDI '18
- FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds, NSDI '19
- Slim: OS Kernel Support for a Low-Overhead Container Overlay Network, NSDI '19
- Houdini's Escape: Breaking the Resource Rein of Linux Control Groups, CCS '19
- Particle: Ephemeral Endpoints for Serverless Networking, SoCC '20
- Parallelizing packet processing in container overlay networks, EuroSys '21
- MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications, ATC '21
- Starlight: Fast Container Provisioning on the Edge and over the WAN, NSDI '22
- Transparent GPU Sharing in Container Clouds for Deep Learning Workloads, NSDI '23