ESCAN: Efficient GPU sharing for cascade neural network inference

  • 0National University of Defense Technology, Deya Road, Changsha, 410000, Hunan, China; Key Laboratory of Advanced Microprocessor Chips and Systems, Deya Road, Changsha, 410000, Hunan, China.

Summary

This summary is machine-generated.

We developed ESCAN, a GPU-sharing framework for cascade neural networks, to improve inference efficiency. ESCAN optimizes device sharing by balancing resource gains with early-exit mechanisms, enhancing low-latency services.

Area Of Science

  • Computer Science
  • Artificial Intelligence
  • Machine Learning

Background

  • Cascading models balance efficiency and accuracy in industrial deployments.
  • Low-latency services demand optimized execution efficiency and device utilization.
  • Existing GPU sharing (Multiprocessing Services) struggles with cascade models due to early-exit and execution order.

Purpose Of The Study

  • To address the challenges of applying GPU sharing to cascade neural networks.
  • To propose a framework that optimizes inference efficiency for cascade models.
  • To enhance device utilization and reduce latency in cloud-based inference services.

Main Methods

  • Analyzed cascade neural network characteristics and device-sharing techniques.
  • Developed ESCAN, a GPU-sharing optimization framework for online inference.
  • Integrated exit-ratio-aware batch-parallel execution and resource allocation algorithms in PyTorch.

Main Results

  • ESCAN improves inference efficiency by an average of 19.53% compared to parallel execution.
  • Significantly enhances the efficiency of searching for computation resource allocation schemes.
  • Optimizes computational resource utilization through effective GPU-sharing.

Conclusions

  • ESCAN provides an effective solution for GPU-sharing optimization in cascade neural networks.
  • Achieves a balance between device-sharing gains and early-exit computation wastage.
  • Delivers a low-latency, high-precision optimization for interactive online services using cascade models.

Related Concept Videos

Parallel Processing 01:20

252

The brain processes sensory information rapidly due to parallel processing, which involves sending data across multiple neural pathways at the same time. This method allows the brain to manage various sensory qualities, such as shapes, colors, movements, and locations, all concurrently. For instance, when observing a forest landscape, the brain simultaneously processes the movement of leaves, the shapes of trees, the depth between them, and the various shades of green. This enables a quick and...

Fast Decoupled and DC Powerflow 01:24

303

The fast decoupled power flow method addresses contingencies in power system operations, such as generator outages or transmission line failures. This method provides quick power flow solutions, essential for real-time system adjustments. Fast decoupled power flow algorithms simplify the Jacobian matrix by neglecting certain elements, leading to two sets of decoupled equations:

 These simplifications reduce the computational burden significantly compared to the full Newton-Raphson method....

Maxwell-Boltzmann Distribution: Problem Solving 01:20

1.8K

Individual molecules in a gas move in random directions, but a gas containing numerous molecules has a predictable distribution of molecular speeds, which is known as the Maxwell-Boltzmann distribution, f(v).
This distribution function f(v) is defined by saying that the expected number N (v1,v2) of particles with speeds between v1 and v2 is given by

Since N is dimensionless and the unit of f(v) is seconds per meter, the equation can be conveniently modified into a...

Ampere-Maxwell's Law: Problem-Solving 01:17

774

A parallel-plate capacitor with capacitance C, whose plates have area A and separation distance d, is connected to a resistor R and a battery of voltage V. The current starts to flow at t = 0. What is the displacement current between the capacitor plates at time t? From the properties of the capacitor, what is the corresponding real current?
To solve the problem, we can use the equations from the analysis of an RC circuit and Maxwell's version of Ampère's law.
For the first part of...

Neural Circuits 01:25

1.6K

Neural circuits and neuronal pools are two of the main structures found in the nervous system. Neural circuits are networks of neurons that work together to carry out a specific task or process. They consist of interconnected neurons and glial cells, which provide structural and metabolic support.
Neuronal pools are collections of nerve cells with similar functions and interact through chemical and electrical signals. These pools include both interneurons (the central neural circuit nodes that...

Multimachine Stability 01:25

235

Multimachine stability analysis is crucial for understanding the dynamics and stability of power systems with multiple synchronous machines. The objective is to solve the swing equations for a network of M machines connected to an N-bus power system.
In analyzing the system, the nodal equations represent the relationship between bus voltages, machine voltages, and machine currents. The nodal equation is given by:

V is the N-vector of bus voltages, E is the M-vector of machine voltages, I is...