If you are still worried about "wanting both" in edge computing, then this article may be worth reading.
In the design of lightweight convolutional neural networks, how to balance "local receptive field" and "global spatial awareness" under limited FLOPs has always been a core challenge.
Traditional convolution is limited by receptive field size, while conventional SE-Block attention mechanisms cause spatial position information collapse due to global pooling operations. For this reason, a new operator was specially developed. This structure innovatively fuses large kernel depthwise convolution with Coordinate Attention, and through forced residual strategy and GroupNorm optimization, successfully constructs a hardware-friendly feature extraction paradigm with robust position encoding capabilities.
Before analyzing the code, we need to understand the three core pain points this module attempts to solve:
And the attention-fused convolution kernel constructed this time is precisely the solution proposed based on the above theoretical background.
This module is not a simple layer stacking, but a carefully designed feature reconstruction closed loop. The following combines code logic for in-depth analysis step by step:
Code implementation:
Design: The convolution kernel is upgraded from to . From an information theory perspective, this increases the "visible area" of a single neuron.
Theoretical advantages: The receptive field area of convolution is times that of . In lightweight networks (such as MobileNetV3), this large kernel depthwise convolution can effectively simulate the Token Mixer behavior in Transformers, enhancing the ability to capture textures and shapes, and inference frameworks like NCNN already have extremely high Winograd algorithm optimization support for DW operators.
This is the "soul" of this module. Unlike SE's global pooling, this module uses two orthogonal 1D Global Pooling operations to decompose spatial information.
Optimization strategy: A bottleneck layer with reduction=16 is introduced here to reduce model complexity.
Improvement: The introduction of GroupNorm is the finishing touch. In the middle layer of the attention branch, feature channels are compressed and often accompanied by extremely small Batch Size. GN normalizes by grouping channels, and its statistics do not depend on Batch Size, thus solving the "statistical drift" problem caused by BN layers in fine-tuning tasks.
To more clearly demonstrate the internal data flow of this module, we formalize the Forward process into the following detailed steps:
Input .
After DWConv PWConv BN Hardswish.
Output intermediate feature .
Coordinate Information Encoding:
Transformation and Activation:
To verify the effectiveness of this convolution kernel in real scenarios, we conducted rigorous comparative experiments in a controlled environment.
Experimental Setup:
Dataset: Custom detection dataset (containing high similarity categories such as batons, flashlights, knives, etc.).
Training Strategy: SGD optimizer, Cosine LR scheduler, training for 5000 Epochs (to ensure complete model convergence).
Baseline: Only replace this module with standard 3x3 DWConv, keeping all other network architectures completely identical.

COMMON
ENHANCE
The biggest challenge in testing is "inter-class similarity".
For example, long and thin "batons" and "flashlights" are extremely difficult to distinguish at low resolution.
We extracted the model's Top-1 Accuracy on these specific categories for comparative analysis:

COMMON
ENHANCE
This operator demonstrates a highly forward-looking design approach for lightweight networks.
This module is not only a plug-and-play component, but also provides a standard spatial-channel decoupling paradigm for subsequent lightweight detection network design.
self.dw_conv = nn.Conv2d(c1, c1, kernel_size=5, stride=s, padding=2, groups=c1, bias=False)
x_h = self.pool_h(feat) # Output: (N, C, H, 1)
x_w = self.pool_w(feat) # Output: (N, C, 1, W)
y = torch.cat([x_h, x_w], dim=2)
y = self.conv_pool(y)
y = self.gn(y) # GroupNorm for stability
a_h = self.conv_h(x_h).sigmoid()
a_w = self.conv_w(x_w).sigmoid()
out = identity_feat * a_w * a_h
if self.use_res:
return x + out