From ReLU to GeMU: Activation functions in the lens of cone projection

  • 1Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: lijiayun22@mails.tsinghua.edu.cn.
  • 2Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: cyx22@mails.tsinghua.edu.cn.
  • 3Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: luyw20@mails.tsinghua.edu.cn.
  • 4Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: xzf23@mails.tsinghua.edu.cn.
  • 5Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: ylmo@tsinghua.edu.cn.
  • 6Department of Automation and BNRist, Tsinghua University, Beijing, 100084, China. Electronic address: gaohuang@tsinghua.edu.cn.

Abstract

Activation functions are essential to introduce nonlinearity into neural networks, with the Rectified Linear Unit (ReLU) often favored for its simplicity and effectiveness. Motivated by the structural similarity between a single layer of the Feedforward Neural Network (FNN) and a single iteration of the Projected Gradient Descent (PGD) algorithm for constrained optimization problems, we consider ReLU as a projection from R onto the nonnegative half-line R+. Building on this interpretation, we generalize ReLU to a Generalized Multivariate projection Unit (GeMU), a projection operator onto a convex cone, such as the Second-Order Cone (SOC). We prove that the expressive power of FNNs activated by our proposed GeMU is strictly greater than those activated by ReLU. Experimental evaluations further corroborate that GeMU is versatile across prevalent architectures and distinct tasks, and that it can outperform various existing activation functions.

Related Concept Videos

Convolution Properties I 01:20

142

Convolution computations can be simplified by utilizing their inherent properties.
The commutative property reveals that the input and the impulse response of an LTI (Linear Time-Invariant) system can be interchanged without affecting the output:

The associative property suggests that the merged convolution of three functions remains unchanged regardless of the sequence of convolution. For instance, for three functions x(t), h(t), and g(t) is written as,

When two LTI systems with impulse...

Convolution Properties II 01:17

177

The important convolution properties include width, area, differentiation, and integration properties.
The width property indicates that if the durations of input signals are T1 and T2, then the width of the output response equals the sum of both durations, irrespective of the shapes of the two functions. For instance, convolving two rectangular pulses with durations of 2 seconds and 1 second results in a function with a width of 3 seconds.
The area property asserts that the area under the...

Convolution: Math, Graphics, and Discrete Signals 01:24

237

In any LTI (Linear Time-Invariant) system, the convolution of two signals is denoted using a convolution operator, assuming all initial conditions are zero. The convolution integral can be divided into two parts: the zero-input or natural response and the zero-state or forced response, with t0 indicating the initial time.
To simplify the convolution integral, it is assumed that both the input signal and impulse response are zero for negative time values. The graphical convolution process...

Vector Representation of Complex Numbers 01:16

117

Complex numbers, represented in Cartesian coordinates, can also be visualized as vectors. These vectors can be expressed in polar form, emphasizing their magnitude and angle. When a complex number is input into a function, the output is another complex number, highlighting the function's zero point from which the vector representation can originate.
Consider a function defined as the product of the complex factors in the numerator divided by the product of the complex factors in the...

Gradient and Del Operator 01:14

2.5K

In mathematics and physics, the gradient and del operator are fundamental concepts used to describe the behavior of functions and fields in space. The gradient is a mathematical operator that gives both the magnitude and direction of the maximum spatial rate of change. Consider a person standing on a mountain. The slope of the mountain at any given point is not defined unless it is quantified in a particular direction. For this reason, a "directional derivative" is defined, which is a vector...

Coordinates and Map Projections 01:29

36

Coordinates and map projections are essential tools in accurately representing the Earth's surface for various applications, ranging from navigation to spatial analysis. The latitude and longitude coordinate system is a universally recognized framework for defining locations. Latitude specifies the distance of a point north or south of the equator, measured in degrees from 0° at the equator to 90° at the poles. Longitude indicates a location's position east or west of the prime meridian,...