A Review of Neural Network Lightweighting Techniques

: The application of portable devices based on deep learning has become increasingly widespread, which has made the deployment of complex neural networks on embedded devices a hot research topic. Neural network lightweighting is one of the key technologies for applying neural networks to embedded devices. This paper elaborates and analyzes neural network lightweighting techniques from two aspects: model pruning and network structure design. For model pruning, a comparison of methods from different periods is conducted, highlighting their advantages and limitations. Regarding network structure design, the principles of four classical lightweight network designs are described from a mathematical perspective, and the latest optimization methods for these networks are reviewed. Finally, potential research directions for lightweight neural network pruning and structure design optimization are discussed.


Introduction
Deep learning differs significantly from traditional manual feature design.Convolutional neural networks (CNNs) employed in deep learning can automatically extract deep features of targets without the need for manual feature extraction.This characteristic greatly reduces the difficulty of applying image recognition [1][2][3].Consequently, deep CNNs have become increasingly mature and successful in various fields such as military, transportation, and healthcare.However, in order to achieve higher accuracy, the depth of neural networks continues to increase, resulting in higher computational complexity and storage requirements.As performance demands escalate, efficiency becomes a primary concern in network design.Specifically, efficiency issues primarily involve model storage and prediction speed.Therefore, lightweighting techniques are needed to address efficiency concerns while maintaining accuracy [4][5][6][7].
The goal of model lightweighting is to address the inability of traditional neural networks to run on small-scale hardware in terms of storage space and energy consumption.To achieve this goal, optimization techniques such as network structure design and model compression are primarily employed to reduce storage requirements, improve execution speed, and maintain the accuracy of traditional neural networks [8][9][10].In recent years, the research direction of lightweight neural networks has been continuously expanding, requiring ongoing exploration, comparison, and updates.It is worth noting that excellent lightweight network models often possess multifunctionality, and the optimization trends have become diverse, no longer limited to a single model compression algorithm or the replacement of lightweight modules.Therefore, a comprehensive summary of optimization methods for lightweight neural network architectures is necessary [11][12][13].
This paper provides a comprehensive review of classical compression algorithms and network structures for neural networks.Firstly, it elaborates on model pruning algorithms and analyzes and summarizes recent research advancements based on these algorithms.Model pruning encompasses structured pruning and unstructured pruning, with structured pruning covering techniques such as convolutional kernel pruning and channel pruning [14].Next, it analyzes some classical lightweight neural networks from the perspectives of lightweight module design and convolutional structure optimization, and summarizes the latest research achievements based on these network structures.Finally, it discusses the prospects and challenges of lightweight neural networks and provides a comprehensive conclusion [15][16][17].
The paper is organized as follows: Section II introduces two pruning algorithms for model pruning and analyzes their advantages and disadvantages.Section III presents the ideas and methods of network structure design, summarizing four lightweight network structures that have been applied and improved in recent years.It also analyzes the characteristics and performance of these network structures.Section IV discusses the future development trends and challenges of lightweight neural networks.Section V provides a comprehensive summary of the work conducted in this paper.

Model pruning
Model pruning is one of the most commonly used methods in compressing neural network models.Its primary objective is to reduce computational complexity and model size by removing unimportant neurons in the neural network.Model pruning algorithms can be categorized into two types: unstructured pruning and structured pruning.The classification of model pruning methods [18] is illustrated in Figure 1.The distinction lies in whether the entire node or convolutional kernel is removed all at once.In unstructured pruning algorithms, each element of every convolutional kernel is considered, and the parameter information with zero values in the kernels is removed.This pruning method takes into account each parameter in the network model, allowing for more fine-grained pruning.In contrast, structured pruning algorithms employ a coarse-grained pruning approach by directly removing the structured information of entire convolutional kernels.This can effectively reduce the size of the model and improve its performance.Specifically, kernel pruning refers to the removal of a group of convolutional kernels in a convolutional layer, while channel pruning refers to the removal of entire channels in a convolutional layer.This subdivision provides a clearer description of the different ways in which structured pruning can be performed.

Unstructured pruning
Unstructured pruning does not adhere to specific geometric shapes or constraints when removing the parameter information with zero values in the convolutional kernels.Figure 2 illustrates the process of unstructured pruning, demonstrating the fine-grained pruning approach.This paper investigates the representative works and recent advances in unstructured pruning algorithms.The earliest pruning algorithm can be traced back to the Optimal Brain Damage (OBD) algorithm proposed in reference [19], which belongs to the category of single-weight pruning algorithms within unstructured pruning.The OBD algorithm utilizes the Hessian matrix based on the loss function to calculate parameter weights and prunes the parameters with lower weights.However, the OBD algorithm simplifies the calculation of the Hessian matrix by ignoring the off-diagonal terms, which is a hypothetical simplification.Subsequently, reference [20] studied the off-diagonal terms of the Hessian matrix and discovered that the assumption made by the OBD algorithm is invalid in many cases.To overcome the limitations of the OBD algorithm, reference [19] proposed the Optimal Brain Surgeon (OBS) method, which utilizes all second-order derivative information of the error function for network pruning without the need for retraining.Both the OBD and OBS algorithms share a similar drawback, which is the high computational complexity involved in computing and updating the significance of all parameters in each iteration.To address this issue, reference [21] proposed a method that uses the minimum contribution variance as the pruning criterion.If a parameter's output remains almost the same before and after bias, its contribution is considered insignificant, these parameters, which have the smallest contribution variances on the training set, can be removed.In addition, reference [22] proposed a method that directly constructs a weight saliency matrix and performs sorting to select insignificant redundant nodes for pruning.Furthermore, in reference [23], a method based on learning weight connectivity importance for pruning was introduced.This method consists of three steps, as shown in Figure 3. Through an iterative process of connection pruning and weight training, it can reduce the storage and computational requirements by an order of magnitude while maintaining accuracy.

Figure 3:
The three main steps of the pruning process.

Structured pruning
In contrast to unstructured pruning algorithms, structured pruning algorithms target entire structures (such as convolutional kernels or channels) rather than individual parameters.By removing entire structures at once without the need to individually compute parameters, structured pruning algorithms have significantly lower complexity compared to unstructured pruning algorithms.Therefore, structured pruning algorithms have become an important direction in pruning algorithm research.This paper categorizes structured pruning into two types: convolutional kernel pruning and channel pruning, and discusses them [34][35][36].

Convolution kernel pruning
Convolutional kernel pruning is a coarse-grained pruning method characterized by the simultaneous pruning of connected input channels in the subsequent layer when pruning a specific convolutional kernel in a convolutional layer.This method effectively reduces the number of parameters in the model by removing convolutional kernels with lower importance.Figure 4 illustrates the process of convolutional kernel pruning.In reference [37], an algorithm based on global search and convolutional kernel saliency is proposed.In reference [37], an algorithm based on global search and convolutional kernel saliency is proposed.The algorithm utilizes the Taylor expansion criterion to expand the objective function and identifies the convolutional kernel that causes the least change in the objective function as the salient kernel.It then replaces the salient kernel with zero values.Reference [38] improves upon the aforementioned algorithm by introducing consistent extension capabilities to any layer in the network, eliminating the need for sensitivity analysis on each layer.Reference [39] introduces the ThiNet pruning algorithm, which establishes a one-to-one relationship between the current layer's convolutional kernels and the next layer's input channels through convolutional computations.This relationship is utilized to explore the saliency of the input channels for the next layer's convolutional kernels.In reference [40], a novel pruning method for convolutional kernels is proposed, known as Filter Pruning via Geometric Median (FPGM).The central idea of the geometric median, as described in reference [41], is as follows: Given a set of points with a quantity of x  R such that the sum of their Euclidean distances is minimized.This is shown in Equation ( 1).This pruning method is able to satisfy both the requirements of a larger paradigm deviation for the filter and a smaller minimum criterion for the filter.And its usefulness and advantages are verified on two image classification benchmarks.Reference [42] introduces an approximate Oracle convolutional kernel pruning algorithm.This algorithm prunes the kernels by randomly masking them and calculates the cumulative change in the output of the next layer to search for the least significant kernels.Furthermore, reference [43] proposes an end-to-end joint pruning method that can simultaneously prune convolutional kernels and other structures.By employing generative adversarial learning techniques, this method effectively addresses the optimization problem.Reference [44] presents a dynamic pruning algorithm.This algorithm dynamically predicts the saliency of the next layer's convolutional channels during the training process and skips channels with lower saliency.This dynamic nature allows different input images to flexibly skip different channels based on their characteristics.Moreover, reference [45] introduces the meta-convolutional kernel pruning algorithm, which considers the relationships between convolutional kernels and constructs a meta-pruning framework to adaptively select appropriate pruning methods when the distribution of kernels changes.Reference [46] proposes a meta-learning pruning algorithm.This algorithm first trains a pruning network using random structure sampling and utilizes meta-learning to predict the accuracy of the pruned network.It can search for pruned networks under different constraints without requiring manual intervention and does not require fine-tuning during the search process.
Reference [47] proposes a method called overall global pruning, which uses the idea of pruning convolutional kernels to address the limitations of amplitude-based methods when pruning fully connected layers.Furthermore, in reference [48], a novel method named Collaborative Channel Pruning (CCPrune) is introduced.This method effectively assesses the significance of channels by incorporating the weights of convolutional layers and the scaling factors of Batch Normalization (BN) layers.Moreover, reference [49] introduces a method known as Global Filter Importance-based Adaptive Pruning (GFI-AP).This approach assigns importance scores to each convolutional kernel by evaluating how effectively the network learns the mapping from input to output using the dataset.This enables a comprehensive comparison among the kernels.Reference [50] proposes a method for dynamically removing redundant convolutional kernels by embedding manifold information of all instances into the pruned network space.By aligning the manifold information between the recognition complexity and feature similarity of images in the training set with the pruned subnetwork, it maximizes the utilization of redundancy within the given network structure.Reference [51] introduces a novel method for pruning convolutional kernels, utilizing feature map ranking (HRank) for exploration and developing a mathematical formula for kernel pruning.

Channel pruning
Channel pruning is a method used to prune redundant channels in feature maps, without considering the impact of convolutional kernel weights.It is particularly effective in cases where there is a significant amount of redundancy in the feature maps.The process of channel pruning is illustrated in Figure 5.By pruning redundant channels, model compression can be achieved.One of the advantages of channel pruning is that it does not rely on sparse convolutional computation libraries or specialized hardware, yet it can still achieve high compression rates.Reference [52] proposes a method based on eliminating low-activity channels, which reduces the computational operations between each convolutional kernel and channels that do not contribute significantly to the model's predictions.This method effectively reduces the computational load without significantly impacting the model's performance.Similarly, at the channel level, reference [53] introduces a channel pruning method based on LASSO regularization and linear least squares.This method first identifies and removes redundant convolutional kernels and their corresponding feature maps, reducing the model's parameter count and computational complexity.Then, the remaining network is reconstructed to restore the model's predictive ability.
Building upon the work in reference [53], reference [54] argues for the necessity of jointly pruning neurons across the entire neural network based on a unified objective.By considering the relationships between different neurons and their contributions to the overall network performance during the pruning process, a unified objective ensures that the pruned network maintains good predictive performance during the retraining phase.Reference [55] proposes a network slimming method, which is a commonly used pruning algorithm for many large-scale networks.The core idea is to introduce a scaling factor  for each channel and establish an objective function as shown in Equation ( 2): Equation ( 2) where x and y are the input and output of the feature map respectively, w is the weight,   g  is the penalty term,  is the scaling factor, and  is the balance factor.The joint optimisation of the regular term of the scaling factor  and the weight loss function automatically identifies and removes unimportant channels to improve the computational speed of the network.The network thinning process is shown in Figure 6.Reference [56] presents a more general and effective improvement to the method proposed in reference [55].Instead of directly using the parameters of the Batch Normalization (BN) layer, this approach introduces additional scale factors to enhance the method's applicability.Reference [57] proposes a discriminative-aware channel pruning method for pre-trained models.This method introduces an additional channel-aware loss function, which is combined with the classification loss function, and incorporates reconstruction error.It utilizes the 2.0 L norm to iteratively induce sparsity in channel pruning and parameter optimization.Reference [58] challenges the effectiveness of norm-based calculations and proposes a norm-independent channel pruning technique.This method employs an end-to-end random training approach, enforcing the constant output of certain channels and then adjusting the biases of the affected layers to eliminate these constant channels, achieving channel pruning.Furthermore, in reference [59], an optimal thresholding (OT) method is proposed.This method aims to prune channels with layer-correlated thresholds, optimally separating important channels from negligible ones.By utilizing OT, most unimportant channels are pruned to achieve high sparsity while minimizing performance degradation.In reference [60], researchers attempt to determine the channel configuration for pruning models through random search.Experimental results demonstrate the effectiveness of this simple strategy compared to other channel pruning methods.Existing methods often treat the pruning rate as a hyperparameter and overlook the sensitivity of different convolutional layers.Reference [61] introduces a sensitivity-based channel pruning method, measuring it using second-order sensitivity.The underlying concept involves the selective pruning of insensitive filters while preserving the sensitive ones.This is achieved by quantifying the sensitivity of a convolutional kernel through the summation of sensitivities of its individual weights.Additionally, the method incorporates layer sensitivity by considering Hessian eigenvalues, thereby automating the process of determining the optimal pruning rate for each layer.

Analysis and discussion
By analyzing pruning algorithms from different periods, including the latest ones, we can observe significant advantages of unstructured pruning.The most notable advantage is its ability to directly zero out or trim a large number of parameters, resulting in a highly sparse model that does not significantly affect model accuracy.Additionally, unstructured pruning can modify parameters based on the underlying logic of different hardware, leading to improved acceleration.However, unstructured pruning also has noticeable drawbacks.Firstly, due to its consideration of the impact of individual neurons on the network, unstructured pruning algorithms can be computationally intensive.Secondly, simply applying unstructured pruning does not directly accelerate sparse matrix computations, as the size of the pruned matrix remains unchanged.This means that sparse matrix multiplication and other computations are still required, which may not yield substantial acceleration on certain hardware.Moreover, unstructured pruning algorithms may rely on specific software or hardware implementations, limiting their flexibility and portability across different deep-learning frameworks.In contrast, structured pruning algorithms have advantages in these aspects.Structured pruning reduces computational complexity, simplifies sparse matrix computations, and is easier to use across different deep learning frameworks.Consequently, recent research has been inclined towards employing structured pruning algorithms for model pruning [62][63][64][65][66][67].
Structured pruning algorithms have advantages in terms of hardware acceleration and prediction accuracy because they consider a more comprehensive set of factors.Compared to unstructured pruning, structured pruning can achieve model compression and acceleration by pruning entire convolutional kernels or channels.However, structured pruning algorithms also have some limitations.Firstly, in convolutional kernel pruning algorithms, the relationships between kernels are often overlooked.Kernels sometimes work together in a coordinated manner to achieve accurate predictions.Pruning based solely on the individual significance of each kernel may not lead to the optimal pruning results.Secondly, for new models, one-time pruning with structured pruning algorithms often struggles to maintain the same level of accuracy as the original model.Therefore, algorithm-level optimizations are needed to achieve better accuracy preservation.Additionally, conventional structured pruning algorithms require manual configuration of pruning thresholds and other hyperparameters, which limits the automation of the algorithm.As a result, fully automated learning modes cannot be realized [68][69][70][71].
To sum up, structured pruning algorithms have significant advantages over unstructured pruning in terms of hardware acceleration and prediction accuracy.However, there are still challenges to address.For example, it is crucial to consider the relationships between convolutional kernels and optimize the algorithms at the algorithmic level to achieve better accuracy preservation.Additionally, the level of automation in the algorithms needs to be further improved to facilitate a more convenient and efficient model pruning process.These are important directions in current research to further enhance the effectiveness of structured pruning algorithms.

Network Architecture Design
The design of lightweight network architectures aims to reduce model complexity and decrease computational resource consumption by optimizing the network architecture [72][73][74].The goal of this design is to create more efficient network structures that achieve model size compression, faster runtime, and reduced training difficulty.In the network architecture design, this paper discusses how to achieve model lightweight through two aspects: lightweight module design and convolutional structure optimization.

Lightweight module design
The design of lightweight modules aims to reduce model complexity by creating compact and efficient network modules.These modules often employ specific structures and operations to minimize the number of parameters and computational requirements.Additionally, lightweight module design adopts a modular approach, breaking down the network into smaller modules and constructing the entire network by combining these modules.This modular design enhances the flexibility and scalability of the network.

Fire module
The structure of the Fire module consists of two sub-layers: the squeeze layer and the expand layer.The squeeze layer utilizes a 1 1  convolutional kernel, while the expand layer employs both 1 1  and 3 3  convolutional kernels.Figure 7 illustrates the structure of the Fire module.To reduce the number of network parameters, the Fire module utilizes the design of the squeeze and expand layers.In the expand layer, a 1 1  convolutional kernel is used instead of a 3 3  convolutional kernel to decrease the number of 3 3  convolutional kernels.Simultaneously, the squeeze layer employs a 1 1  convolutional kernel to limit the output channel count.This design strategy was applied in the classic SqueezeNet [75], where the Fire module serves as its core module.Compared to AlexNet [76], the SqueezeNet network constructed by stacking Fire modules reduces the number of parameters by 50 times while maintaining comparable accuracy.In addition to SqueezeNet, a novel neural network architecture called SqueezeNext is introduced in reference These redundant feature maps increase the computational burden of the network, and removing them would significantly degrade the model's recognition performance.
To address this issue, Ghost provides a method to generate a large number of similar feature maps at a smaller computational cost.The core idea of the Ghost module is to generate multiple feature maps by sharing convolutional kernels, thereby reducing the number of parameters and computations.By leveraging inexpensive linear transformation operations (Cheap), the Ghost module can extract rich feature representations while maintaining a smaller computational cost.This makes the Ghost module highly applicable in lightweight network designs.  GhostNet [82] is a lightweight neural network based on the MobileNetV3 [83] architecture.It replaces ordinary convolutions in MobileNetV3 with Ghost modules, forming Ghost bottlenecks, and builds GhostNet upon this foundation.
Experimental results have demonstrated that compared to other lightweight neural network architectures such as MobileNet series and ShuffleNet series, GhostNet achieves higher accuracy in ImageNet classification tasks while having comparable parameter and computational counts.In reference [84], researchers applied GhostNet to the backbone network of YoloV4, resulting in an improved network called Ghostnet-YoloV4.This enhanced network efficiently extracts features and significantly reduces the number of real-time counting operations.Through field testing in nursery plots, this method not only effectively overcomes noise interference in large field environments but also meets the computational requirements of low-configured management system embedded mobile devices.The counting and measurement accuracy both exceed 92%.To further enhance the performance of lightweight image recognition models, reference [85] introduces L-GhostNet.This model integrates group convolution learning and an improved Channel Attention (CA) method into GhostNet.Experimental results show that compared to GhostNet, L-GhostNet achieves slightly higher accuracy across various datasets while reducing computational cost by over 44% and parameter count by over 33%.It also provides a 26% increase in frames per second (FPS).When compared to commonly used lightweight network models, such as MobileNets and ShuffleNets, operating at the same FLOP (Floating Point Operations) level, L-GhostNet demonstrates superior performance.L-GhostNet achieves the lowest FLOPs, highest accuracy, and fewer parameters, showcasing its exceptional overall performance.Furthermore, reference [86] proposes a CBAM-GhostNet-SSD network, which introduces Ghost modules and Efficient Channel Attention (ECA) mechanism to the SSD object detection algorithm.By dynamically allocating parameters and changing the weights of detection regions, this method improves the model's performance.To enhance the recognition accuracy of small objects, the CBAM module is also introduced.Compared to the original SSD network, the CBAM-GhostNet-SSD network reduces parameter and computational counts by 98.23% and achieves a 14.5% increase in mAP.

Convolutional structure optimization
Convolutional structure optimization aims to reduce computational resource consumption by optimizing the convolutional layers.Common optimization methods include the use of lightweight convolutional operations such as depthwise separable convolution and grouped convolution.These lightweight convolutional operations maintain lower computational complexity while still preserving a certain level of prediction accuracy.

Group convolution
Grouped convolution first appeared in AlexNet and was designed to address limited hardware resources.It allows for parallel computation on two GPUs, with their results subsequently fused.Grouped convolution divides the input feature map into groups based on channels and applies convolutional operations to each group individually.The results of the grouped convolution are then concatenated along the channel dimension to obtain the final output feature map.This operation has a lightweight effect, reducing computational complexity.
Assuming that the input feature map is H W N   , the size of the convolution kernel is K K  , and M convolution kernels are used to perform the convolution operation, the output is . When the input feature channels are divided into G groups in the grouped convolution, the calculation amount is 1 G of the standard convolution.
Therefore, grouped convolution reduces the computational burden by dividing the input feature channels into multiple groups.This is particularly useful in scenarios with limited hardware resources.However, it is important to note that grouped convolution also introduces some information loss since there is no direct interaction between channels within each group, resulting in inadequate fusion of information across feature maps [87][88][89].Therefore, when choosing the number of groups, a balance between computational efficiency and model performance needs to be considered.ShuffleNetV1 [90] proposed a channel shuffle method to address the limitation of information exchange between groups in grouped convolution.In order to maintain the recognition accuracy, a uniform and random shuffling is performed on the feature maps of grouped convolution, as shown in Figure 10.The purpose of this channel shuffle operation is to ensure that the input information for the next grouped convolution comes from different groups.ShuffleNet achieved a 13-fold speed improvement compared to AlexNet while maintaining accuracy.However, ShuffleNetV2 [91] pointed out that evaluating model performance solely based on parameter count and floating-point operations (FLOPs) is inaccurate.Based on experimental observations, the actual runtime of a model depends not only on computational operations but also on factors such as memory read/write, GPU parallelism, and file I/O.Therefore, ShuffleNetV2 proposed four guidelines to improve model efficiency: (1) Use convolutions with the same number of input and output channels.This minimizes memory read/write and communication overhead.( 2) Reduce the use of grouped convolution.Although grouped convolution reduces computational burden, it also introduces information isolation.Therefore, ShuffleNetV2 suggests minimizing the use of grouped convolution to improve information exchange and overall performance.convolution with the same number of input and output channels is introduced to meet the requirement of guideline (1).Simultaneously, the use of grouped convolution is abandoned to comply with guideline (2).Finally, the feature addition operation is replaced with the channel concatenation (Concat) operation, aligning with the guideline (4).These improvements not only enhance the runtime speed but also improve the accuracy.Experimental results show that ShuffleNetV2 achieves a 63% speed improvement compared to ShuffleNetV1.In the latest research on ShuffleNet, a lightweight network called (2+1) D Distilled ShuffleNet is proposed in [92] for human action recognition using an unsupervised distillation learning paradigm.This network extracts knowledge from the teacher network through distillation techniques without the need for labeled data.On the UCF86 and HMDB4 datasets, (2+1) D Distilled ShuffleNet achieves better accuracy and inference runtime than other state-of-the-art distilled networks.Furthermore, reference [93] presents a lightweight garbage classification model known as Garbage Classification Network (GCNet), which builds upon ShuffleNetV2 with three notable enhancements: the incorporation of the Parallel Mixed Attention Mechanism (PMAM), utilization of a novel activation function, and the application of transfer learning.
Experimental findings indicate that GCNet achieves an outstanding average accuracy of 97.9% on a custom dataset, showcasing a significant improvement of nearly 10% compared to ShuffleNetV2, while maintaining a similar number of model parameters.Furthermore, in [94], a recognition of individuals (RE) network called ShuffleNet-Triplet is proposed for individual cow recognition.This network utilizes ShuffleNetV2 for feature extraction to reduce the number of parameters and strengthen the network's ability to distinguish similar individuals by organically combining the triplet loss and cross-entropy loss.BNNeck is also introduced to reduce conflicts between the two loss functions.Experimental results show that ShuffleNet-Triplet achieves a 6.88% improvement in average accuracy compared to the ShuffleNetV2 model.

Deep separable convolution
Deep separable convolution is a convolutional operation that consists of two steps: depthwise convolution and pointwise convolution.This convolution operation effectively reduces computational complexity and model parameters.In the depthwise convolution step, each channel of the input features is convolved separately, as shown in Figure 12(a).The purpose of this step is to capture features while preserving their spatial information.Next, in the pointwise convolution step, the output of the depthwise convolution is convolved with a 1 1  convolutional kernel, as shown in Figure 12(b).The goal of pointwise convolution is to fuse information from different channels by convolving the set of output feature maps from the depthwise convolution with a 1 1  kernel.This process generates the final output feature map.Such a combination can make the model achieve the effect of lightweight, here we assume an input feature map height and width of H and W , the number of channels is M , the output feature map height and width are unchanged, the number of channels is N .The number of standard convolution kernels is N , the size is K K M   .Then the standard convolution , and the computation of the point-by-point convolution is 1 1 M N H W      .The ratio r C obtained by comparing the total computation of the depth-separable convolution with that of the standard convolution is shown in in Equation (4): This ratio Cr can be used to measure the extent of computational reduction achieved by using deep separable convolution compared to standard convolution.Similarly, the ratio of parameter count between deep separable convolution and standard convolution is The MobileNet series is a collection of lightweight network models based on deep separable convolution.These models aim to reduce computational complexity and parameter count while maintaining good performance, particularly in tasks such as image classification.MobileNetV1 [95] was the first model in this series, which replaced traditional convolutional operations with deep separable convolution, resulting in a significant reduction in computational complexity and parameter count.In the ImageNet classification task, MobileNetV1 achieved comparable performance to traditional network models such as GoogleNet and VGG-16 while reducing the model parameters by nearly 30 times.MobileNetV2 [96]  Experimental results show its significant potential in breast cancer WSI detection.

Analysis and discussion
From the performance comparison of various lightweight networks summarized below (Table 1), it can be seen that GhostNet exhibits the best overall performance among these lightweight neural networks.It achieves the highest accuracy while maintaining relatively low computational complexity.
The SqueezeNet series has higher computational complexity, poor scalability, and relatively lower recognition accuracy, but it has a smaller parameter count.The ShuffleNet series has lower computational complexity and utilizes channel shuffling operations to better leverage the information from each channel, thereby improving network accuracy.The MobileNet series, although not as impressive as other lightweight neural networks in experimental data, has widely applied the concept of deep separable convolution in many large-scale networks that require lightweight designs [100][101][102].Each classic lightweight network model has its unique advantages, and there is currently no single network that can perfectly balance speed and accuracy.This provides us with a direction for future research, which is to explore better lightweight network models that achieve a better balance between speed and accuracy [103][104][105].Movements and performance of these four classic network models in recent years.However, it can be observed that these papers mainly focus on applications in different domains, with less emphasis on innovative network structure design or lightweight module improvements.From the "Improvements" column in Table 2, it can be seen that some papers attempt to improve accuracy by introducing different attention mechanisms.However, this often increases the complexity of the network model, making it challenging to achieve lightweight goals.Other papers replace the backbone network of large models with these four classic lightweight networks to achieve lightweight models, but this often significantly reduces the recognition accuracy of the models.
However, only a few papers [106][107][108][109][110][111][112][113][114][115][116][117][118][119][120][121] have conducted further optimization research based on lightweight network architectures.For example, in the MicroNet model proposed in [121], the concept of microfactorized convolution is introduced.This method decomposes the convolution matrix into low-rank matrices to integrate sparse connections into the convolution, resulting in significant performance improvements at low FLOP states, surpassing existing techniques.Therefore, to promote the development of lightweight network computations, our research direction should focus more on innovative network structures and improvements in lightweight modules.Such efforts will reduce the complexity of models while maintaining good performance, and further drive the advancement of lightweight network computations.

Table 2:
Improvement and performance of the latest lightweight networks.

Models Improvements Performance
Modified SqueezeNet [122] Improved SqueezeNet architecture by reducing the number of Fire modules and increasing the number of pooling layers Improved precision by 28% and recall by 20% compared to the original model SqueezeNet OSQN-DNN [78] 1) use the Coyote Optimization Algorithm (COA) to optimize the hyperparameters involved in the SqueezeNet model.
2) using DNN models as classifiers to assign appropriate class labels.
Compared to the existing high caliber DLPSO algorithm, the accuracy is 1% higher and the runtime is reduced by 0.13s.
Compared with SqueezeNet, the test accuracy is improved by 6.2% and the computation time is reduced by 1s.
CBAM-GhostNet-SSD [86] 1) Introduce the Ghost module into the SSD network and add the ECA attention mechanism.
2) Add CBAM module to the network.
Compared to Ghost-SSD, mAP is up 1% and FPS is up 3 frames/s.
2) replacing the normal convolutional blocks of PANet in YOLOv4 with depth-separable convolutional blocks.
Nearly 4% improvement in mAP for identifying nursery saplings.
2) Introduced improved CA attention mechanism (p-CA) Compared to the original model GhostNet, it reduces the amount of computation by 44% and the number of parameters by 33%, and improves the FPS by 26%.
2) knowledge extraction from two teacher networks.
Compared to the ResNet-18 backbone, the accuracy is improved by 0.7% on the UCF01 dataset, the amount of parameters is reduced by almost 30%, and the amount of computation is reduced by almost 60%.
Compared with the original model ShuffleNetV2, the accuracy is improved by 4.5%, and the number of parameters and computation are reduced by nearly 8%.

ShuffleNet-Triplet[94]
The triple loss function and the cross-entropy loss are calculated separately using the BNNeck structure, and then the two are combined.
Nearly 3% improvement in recognition of individual cows compared to the original model ShuffleNetV2.
The accuracy is 1.3% higher than MobileNetV3 and the computation is reduced by 17.4%.
2) Combine center loss and softmax loss to optimize model parameters.
Accuracy is about 3% higher than MobileNetV1 on the RAF-DB dataset.
On the BACH-challenged Part B WSI segmentation dataset, the average accuracy is improved by 1% compared to MobileNetV3.

Challenges and Prospects
Currently, intelligent mobile devices are moving in the direction of edge computing and lightweight development.A key research focus at present is how to minimize model latency and storage space while maintaining neural network model accuracy to the greatest extent possible.
Most existing methods in model pruning algorithms eliminate redundant connections or neurons in the network.However, these low-level pruning methods pose non-structural risks.Irregular memory access patterns during computer operations can also impede further acceleration of the network [123][124][125].
In contrast, structurally pruned networks have smaller model sizes, faster execution speeds, and reduced storage space requirements, making them more suitable for deployment on computationally limited mobile devices.Among them, convolutional kernel pruning is one of the hot research topics in structured pruning.Most convolutional kernel pruning algorithms typically involve three classic steps: pre-training, pruning, and fine-tuning.However, in [126], various algorithmic evaluations were conducted on multiple network structures, revealing that training small target models from random initialization can achieve identical or even better model performance than classical three-step pruning algorithms.Additionally, training models from scratch can also achieve equivalent or better performance than fine-tuning models.This indicates that in pruning algorithms, it is more important to find suitable network structures rather than considering how to preserve important weights within existing structures.Furthermore, most current convolutional kernel pruning algorithms only prune in a single dimension such as depth, width, or resolution, which may lead to excessive loss in a particular dimension and reduce accuracy, while the compression rate of the model may not be significantly high.Exploring pruning from multiple dimensions could potentially yield better results, and this is an avenue worth investigating in the future [127].Therefore, future research should focus on designing appropriate network architectures and exploring multidimensional pruning methods that prioritize achieving high efficiency pruning algorithms while maintaining model accuracy.This will facilitate the deployment of lightweight neural network models on computationally limited mobile devices.
However, there are still some challenges in pruning algorithms.For example, the evaluation systems and metrics used to assess the importance of weights and the performance of pruning algorithms are often oversimplified.Therefore, it is crucial to propose effective methods for measuring the impact of pruning on models, which remains a challenge in model pruning algorithms.Researching these challenges in pruning methods and proposing algorithms with superior performance holds great potential for development.Additionally, exploring algorithms that do not rely on manually designed hyperparameters is a promising direction.Currently, research in this area is relatively limited but holds significant importance for the advancement of pruning algorithms.
Significant progress has been made in network architecture design, primarily focusing on designing lighter modules and optimizing convolutional structures.Currently, there are two main approaches to designing lightweight models: (1) Improving existing lightweight modules based on specific requirements.(2) Designing traditional modules that meet the desired criteria and then replacing the convolutions in these modules with lightweight convolutions, adjusting the structural relationships between modules using existing lightweight functional structures or tools such as activation functions.Most lightweight networks adopt the first approach, which has achieved noticeable results in terms of model lightweighting.However, this approach has reached a plateau in terms of the degree of model lightweighting, making it difficult to achieve further breakthroughs.In contrast, the second approach is more challenging because it heavily relies on the designer's expertise and prior knowledge of deep learning.Designers need to possess a wealth of prior knowledge, such as how to design structures that resemble sparsely connected connections between human neurons in principle, and how to control the influence of prior knowledge while improving performance metrics like latency, computation speed, and storage space by altering the network structure.
Currently, reinforcement learning-based neural network architecture search is the mainstream approach for network architecture design.This method uses a reinforcement learning controller to search and generate network structures within the search space, eliminating the need for extensive manual effort.This is a key reason for its rapid development.However, reinforcement learning-based neural network architecture search methods tend to focus too much on improving model accuracy while neglecting the limitations of underlying hardware devices.The resulting models often have high hardware requirements and are challenging to deploy on embedded devices.Therefore, lightweight network architecture design still faces some challenges.Future research directions include finding a balance between model accuracy and hardware requirements and considering the limitations of underlying hardware devices during the design process.It is essential to develop methods that can optimize both the performance of the model and its compatibility with resource-constrained devices.
Indeed, whether it is model pruning or network architecture design, the goal is not only to reduce the complexity of convolutional neural networks but also to maintain or even improve the original accuracy.With the rapid development of deep learning, emerging technologies such as Graph Neural Networks (GNNs) [128][129][130], and the integration of neural networks with Transformers (e.g., Vision Transformers or ViTs) [131][132][133], have gradually matured and gained recognition in the academic community.However, how to lightweight these models and apply them to real-world industrial applications while ensuring post-application security is a significant challenge for model compression and acceleration techniques [134][135].

Conclusion
This paper summarizes the methods of model pruning and network architecture design in lightweight neural networks.Regarding model pruning methods, a comparison and analysis of structured pruning and unstructured pruning algorithms are presented, highlighting their characteristics.Generally, unstructured pruning exhibits irregularity and requires specific hardware to leverage its advantages, while structured pruning offers more pruning options and is more easily applicable to general-purpose hardware.Currently, structured pruning algorithms are commonly used.In terms of network architecture design, this paper provides an overview of four lightweight neural networks: SqueezeNet, GhostNet, ShuffleNet, and MobileNet.The mathematical principles behind their ability to achieve model lightweight are explained, and their performance on the ImageNet dataset is compared and analyzed.Based on existing work, it can be observed that for model pruning, there is a need to establish a comprehensive evaluation metric system to measure algorithm performance effectively.Regarding network architecture design, further exploration is required to develop methods that can accelerate computation speed and reduce storage space while maintaining the accuracy of the original model as much as possible.

Figure 4 :
Figure 4: Illustration of the convolutional kernel pruning process.

Figure 5 :
Figure 5: Illustration of the channel pruning process.

Figure 6 :
Figure 6: Illustration of the network slimming process.

Figure 7 :
Figure 7: Fire module schematic [77].SqueezeNext combines the design principles of SqueezeNet and tensor decomposition.By decomposing the convolutional operation into an additive step followed by separable convolutional operations, SqueezeNext achieves a 3.2 times faster speed than SqueezeNet without sacrificing accuracy.Reference[78] presents an optimized SqueezeNet-based squeeze network model (OSQN-DNN) for unmanned aerial vehicle (UAV) aerial image classification.This model uses OSQN as the feature extractor and applies the Coyote Optimization Algorithm (COA)[79] to optimize the hyperparameter selection in the SqueezeNet model, significantly improving the overall classification performance.Experimental results demonstrate that the OSQN-DNN model achieves better accuracy and runtime inference time compared to SqueezeNet on the benchmark UCM dataset.In reference[80], an enhanced version of the SqueezeNet convolutional neural network is introduced.This improved model incorporates data preprocessing techniques such as data normalization and the Synthetic Minority Over-sampling Technique (SMOTE).Furthermore, the continuous wavelet transform is utilized to generate spectrograms, which are then employed for training and testing the modified SqueezeNet model.The results demonstrate that this enhanced SqueezeNet model achieves a remarkable accuracy of 90%.In reference[81], a novel approach combining the Aquila Sine Cosine Algorithm (ASCA) with the SqueezeNet model is presented.This integration aims to reduce both the training time and computational complexity of the detection process.By leveraging the ASCA technique, the weights of Deep Convolutional Neural Networks (DCNN) and SqueezeNet are updated, resulting in improved efficiency.Experimental results demonstrate the superior performance of this combined model.3.1.2Ghost module This paper visualizes the process of neural networks extracting features from data, as shown in Figure 8.During the feature extraction process, many features are similar.

Figure 8 : 9 .
Figure 8: Neural network visualization feature map.Let's analyze the principle behind Ghost in reducing computations from a theoretical perspective, as shown in Figure 9. Assuming an input feature map has a channel count of c we use m ordinary convolutional kernels size of k k  to generate m intermediate feature maps.Each intermediate feature map is then transformed into s "ghost" feature maps through a series of linear operations (Cheap), combined with the identity mapping of the previous m intermediate feature maps, resulting in n output feature maps.In the Ghost module, the number of identity mappings is n m s  , and the number of

Figure 9 :
Figure 9: Ghost Convolution Principle.Next, we further analyze the memory and computational benefits brought by using Ghost convolution through mathematical derivation, as shown in Equation (3).Through theoretical analysis, replacing ordinary convolution with Ghost convolution can reduce the number of convolutional parameters while obtaining the same number of feature maps, effectively reducing the model parameter count by approximately s-fold.

( 3 )
Reduce network branches.Branch operations in the network increase computational and communication overhead.(4) Minimize element-wise operations.

Figure 10 :Figure 11 .
Figure 10: Channel shuffle.ShuffleNetV2 is an improvement over ShuffleNetV1 based on the four guidelines mentioned earlier.Its structure is shown in Figure 11.In the residual branch of ShuffleNetV2, a 1 1 
further improved upon MobileNetV1 by introducing linear bottleneck structures and inverted residual blocks.The linear bottleneck structure combines depthwise convolution and pointwise convolution, enhancing both speed and accuracy.The inverted residual structure improves information flow and feature propagation.MobileNetV3 incorporates neural architecture search (NAS) to automatically obtain optimal network parameters and introduces the SE attention mechanism to enhance feature interaction between channels.This architecture demonstrated better performance than MobileNetV1 and MobileNetV2 in the ImageNet classification task.The latest research on the MobileNet series includes Mobile-Former [97], which combines MobileNet with the Transformer architecture to create a lightweight framework.By leveraging the advantages of MobileNet in local processing and Transformer in global interaction through a bidirectional bridge connection, this structure achieves bidirectional fusion of local and global features.In the ImageNet classification task, it achieved a 3.3% improvement in accuracy compared to MobileNetV1 while reducing computational complexity by 17%.A-MobileNet [98] introduces attention modules and parameter optimizations to the MobileNetV1 model, demonstrating better performance on FERPlus and RAF-DB datasets compared to other models.Additionally, BM-Net [99] is a lightweight network composed of a bilinear structure and MobileNet-V3, specifically designed for analyzing breast cancer whole-slide images (WSI).

Table 1 :
Comparison table of four lightweight network model families.