SRCNN: Stacked-Residual Convolutional Neural Network for Improving Human Activity Classification Based on Micro-Doppler Signatures of FMCW Radar
Article information
Abstract
Current methods for daily human activity classification primarily rely on optical images from cameras or wearable sensors. Despite their high detection reliability, camera-based approaches suffer from several drawbacks, such as low-light conditions, limited range, and privacy concerns. To address these limitations, this article proposes the use of a frequency-modulated continuous wave radar sensor for activity recognition. A stacked-residual convolutional neural network (SRCNN) is introduced to classify daily human activities based on the micro- Doppler features of returned radar signals. The model employs a two-layer stacked-residual structure to reuse former features, thereby improving the classification accuracy. The model is fine-tuned with different hyperparameters to find a trade-off between classification accuracy and inference time. Evaluations are conducted through training and testing on both simulated and measured datasets. As a result, the SRCNN model with six stacked-residual blocks and 64 filters achieves the best performance, with accuracies exceeding 95% and 99% at 0 dB and 10 dB, respectively. Remarkably, the proposed model outperforms several state-of-the-art CNN models in terms of classification accuracy and execution time on the same datasets.
I. Introduction
In the past few years, the immediate detection and accurate classification of human activities have become critical health concerns. Timely detection of abnormal activities that are potentially dangerous, such as falls due to stroke or myocardial infarctions, is especially crucial for the elderly and people with special needs [1]. Such activities can be monitored using wearable devices, cameras, and modern sensors. Although wearable devices like smartphones or smartwatches offer significant benefits, they can be inconvenient for users to wear continuously. In addition, vision-based devices may raise privacy concerns [2] and are ineffective in dark or foggy conditions. To address those issues, radar-based non-wearable devices offer a promising solution for non-contact monitoring of human movements, as they perform well in invisible environments.
When individuals engage in daily activities, the simultaneous movements of different body parts, such as swinging the arms and legs, rotating the body, and bending are categorized as micromotions, distinct from the overall translational motion of the body. These micromotions create the micro-Doppler (m-D) effect, an additional modulation in the Doppler frequency shift observed during movement. In the context of human motions, this effect arises from limb movements, leading to unique m-D signatures that correspond to specific activities, derived from the intricate motions of the torso and the limbs [3]. Researchers have extensively utilized and studied these m-D signatures, often represented in spectrograms, for the classification of daily activities and gait analysis [4, 5].
About a decade ago, the detection and classification of human activities using radar technology relied on machine learning (ML) algorithms that manually extracted features from m-D signatures in received radar signals [3]. Although these approaches achieved specific results, they often suffered from variable classification accuracy influenced by the expertise of the human analyst.
In recent years, deep learning has emerged as a leading technology for activity monitoring and recognition. Convolutional neural networks (CNNs) capable of automatic feature extraction, offer potential solutions to overcome the limitations of ML-based methods and achieve higher classification accuracy. For instance, in [6], two conventional classifiers (support vector machine and k-nearest neighbors) and a well-known CNN (GoogleNet) were utilized to recognize activities based on m-D signatures captured by a frequency-modulated continuous wave (FMCW) radar. However, these models exhibit high false alarm rates, with error rates of 21.75%, 22.85%, and 25.3%, respectively. In [7], various pre-trained models, such as AlexNet, VGGNet, and a custom-designed CNN, were explored for extracting and categorizing human activities. VGGNet, known for its deep network structure, achieved the highest accuracy of 95%. However, this improvement came at the cost of increased processing time and complexity.
Despite achieving higher recognition rates compared to ML-based classifiers, transfer learning has notable drawbacks. First, the model’s structure relies solely on the design of the chosen CNN and remains unadjusted to suit the input dataset. Consequently, the architecture might not be optimal for the given task, leading to potential issues of overfitting or underfitting. Second, the pre-trained model’s optimal weights may not be suitable for the new context and require updating with new training data. This can result in suboptimal performance for new data types, such as the m-D spectrogram of FMCW radar, and may restrict the model’s ability to generalize across different contexts. Therefore, carefully considering the chosen CNN architecture and updating the pre-trained model weights with the appropriate training data are necessary to ensure optimal performance in new contexts.
In recent years, self-custom design models have emerged as a novel approach for researchers. This approach involves adjusting the standard CNN structure, constructing lightweight models with simple forms, and designing improved models using contemporary techniques. Developing a customized deep-learning model necessitates balancing multiple factors, including accuracy, computation speed, noise sensitivity, and other factors. Accuracy, processing speed, and noise sensitivity are particularly prioritized among these factors. Self-designed models have gained popularity in various fields, such as signal modulation classification [8, 9], object classification [10, 11], text classification [12], and human activity classification [7, 13, 14]. However, their use in detecting human activities based on m-D spectrogram images from FMCW radar remains limited. To our knowledge, [7] and [14] represent advancements in this area, with [13] being the most recent. In [7] and [14], two custom-designed deep CNNs (DCNNs) were proposed to effectively address feature extraction and recognition issues for human activities using spectrogram images. However, these models showed lower accuracy compared to pre-trained models due to their simplistic structures, consisting only of a few convolutional layers connected directly to activation layers. In [13], a dense inception neural network (DINN) was introduced to classify 11 human indoor activities. DINN modified the architectures of Inception modules and utilized skip-connections from DenseNet to address gradient vanishing issues typical in CNNs with consecutive convolution and activation layers. The experimental results on a simulation dataset demonstrated that the self-custom design-based approach achieved the best balance in terms of classification accuracy and learnable parameters compared to the ML-based methods. However, DINN remains sensitive to noise. Therefore, this study proposes a novel customized CNN, the stacked-residual convolutional neural network (SRCNN), aimed at achieving high accuracy, low noise sensitivity, and consistent computation times for identifying human activities based on m-D signatures. The main contributions of this work can be succinctly summarized as follows.
First, the architecture of SRCNN is built using multiple advanced feature-extracting blocks known as stacked-residual blocks (S-R blocks), which incorporate novel designs inspired by im-res block, additional operation, and skip connections. These blocks utilize two parallel convolutional layers of different sizes (3 × 3 and 5 × 5), enabling the extraction of diverse extracted information from distinct spatial regions from the same input feature map. This approach can improve the classification accuracy of the network compared to conventional fixed-size filters. Moreover, the use of skip connections facilitates efficient reuse of earlier features that might otherwise be attenuated through the previous filtering stages. This strategy enhances the model’s accuracy while retaining a reasonable level of complexity and computation cost.
Second, fine-tuning the hyperparameters on two separate datasets (a noise-added dataset and a real dataset) demonstrated the efficiency of S-R blocks for multi-class action classification. The proposed model achieved the best classification results of 95% at −4 dB signal-to-noise ratio (SNR), 99% at 10 dB SNR, and 97% with the real dataset. SRCNN also outperformed other available CNNs in terms of accuracy and time consumption.
The rest of this paper is structured as follows: Section II describes the FMCW radar operation and the dataset collection. The proposed SRCNN structure is presented in Section III. Experimental and comparative results are discussed in Section IV. The final section provides the conclusion.
II. FMCW Radar and Dataset Description
1. Principles of the FMCW Radar
The diagram of an FMCW radar system is illustrated in Fig. 1, consisting of four main components. The waveform generator (WG) generates a control signal for a voltage-controlled oscillator (VCO) to produce an FMCW radar signal whose frequencies vary over time.
The signal is then divided into two paths: the first path goes to the transmit antenna to transmit to the free space, where targets may exist, while the other path is fed to the mixer at the receiver. The transmitted signal of the FMCW radar can be represented as follows [15]:
where, ΔB is the sweep bandwidth, τ is the duration of a chirp, At is the amplitude and f0 is the carrier frequency of the transmitted signal.
When the transmitted signal S(t) hits a target, the signal will reflect and propagate back to the receiving antenna of the radar. The received signal is expressed follows [15]:
where,
At the receiver, the R(t) signal is mixed with the in-phase and quadrature (IQ)-phase components of the copy of the transmitted signal at the mixer and goes through a low pass filter (LPF) to obtain a baseband IQ signal for further processing [15]:
where
If a person is the radar target, their activities can be classified based on the m-D signatures obtained from analyzing the m-D spectrogram of the IQ signal.
2. Micro-Doppler Effect and Data Processing
When people move, the simultaneous movement of different parts, such as swinging arms and legs, is defined as micromotions, in contrast to the translational motion of the body. These micromotions cause the m-D effect, which is considered an additional modulation in the Doppler frequency shift of the entire movement. Consequently, the m-D signatures in the time-frequency spectrum represent micromotions and, therefore, human activities. As such, time-frequency analysis is widely used to identify and classify these activities. Fig. 2 illustrates the procedure for processing the IQ signal of an FMCW radar into the time-frequency spectrogram. Specifically, the IQ signal is transformed into a 2D data matrix, with the first dimension representing fast-time bins and the second dimension representing slow-time bins. The range-time matrix is then obtained by applying the fast Fourier transform (FFT) along the fast-time direction of the matrix. A moving target indication filter with a specific cutoff frequency is applied to the range-time matrix to suppress returns from non-moving objects. Next, the short-time Fourier transform (STFT) algorithm is employed using a sequence of FFTs with short-sized windows and overlapping slides over the entire duration of the new range-time matrix to obtain a time-frequency spectrogram of the input IQ signal. The STFT spectrogram is given as follows [16]:
where x(t) is the input signal and w(t−τ) is the window function. The squared absolute value of the complex STFT is defined as a spectrogram that contains m-D signatures and can be used to classify human activities.
3. Dataset Description
In this study, two datasets are used to evaluate the performance of the proposed SRCNN in classifying human activities based on m-D signatures. The first dataset is a noise-added dataset collected using the SimHumalator software [17]. The second is a real dataset published by the University of Glasgow [18].
3.1 Simulated dataset
The noise-added dataset was recorded using an FMCW radar with a carrier frequency of 24 GHz (K-band). The main parameters of the radar are configured with a chirp bandwidth of 400 MHz, a chirp duration of 1 ms, and 128 samples per chirp. The radar is positioned one meter above the ground, and the distance from the objects to the radar is 3 and 5 meters. Eleven different actions are performed and repeated 60 times at various aspect angles (−90°, −45°, 0°, 45°, 90°). The duration of each execution ranges from 8 to 15 seconds for each specific type of action. White Gaussian noise is added to the dataset with different SNR values ranging from −15 to 10 dB to simulate real conditions and increase the model’s challenge.
3.2 Real dataset
The real dataset was obtained using an FMCW radar sensor with a carrier frequency of 5.8 GHz (S-band), while the remaining parameters were configured in the same way as the previous dataset. It has a total of 1,632 measurements divided into six groups of indoor actions, including walking, drinking, falling, sitting, standing, and picking.
3.3 Data preprocessing
The measurement data obtained from both datasets are in the form of IQ signal and preprocessed according to the flow chart in Fig. 2. As a result, the noise-added dataset included 39,600 spectrogram images (11 actions × 2 distances × 5 aspect angles × 60 iterations × 6 noise levels), while 1,632 spectrograms were obtained for the real dataset. Fig. 3 presents the spectrogram images of “walking” and “walk to fall” in the noise-added dataset. At a noise level of −15 dB, the features are almost completely submerged in the noise background, making them difficult to observe and distinguish. With higher levels of noise, the spectral characteristics of the activities become clearer. Taking a more detailed observation, the m-D frequency fluctuation of the “walking” activity is only about −200 Hz to 200 Hz (Fig. 3(a)–3(f)), while for the “walk to fall” activity, this fluctuation is in the ranges of −400 Hz to 400 Hz (Fig. 3(g)–3(l)). This can be explained by the fact that human limbs tend to swing more vigorously at a much higher speed compared to the movement of the torso due to natural instincts when falling, leading to a sharp increase in m-D frequency fluctuations. Fig. 4 shows the spectrograms of the different activities in the real dataset.
III. Proposed SRCNN-based Activity Classification
In this section, we propose a novel SRCNN to classify human activities based on the m-D signatures of an FMCW radar. The overall structure of the proposed model (as presented in Fig. 5(a)) consists of three blocks: an input block, six S-R blocks, and an output block.
The SRCNN starts with the input block, which consists of an input layer and a conv-unit arranged sequentially. Specifically, the size of the input layer is 306 × 306 × 1, matching the size of the input spectrogram images. The conv-unit consists of a convolutional (conv) layer, a batch normalization (norm) layer, and an activation function (rectified linear unit [ReLu]). In detail, the convolutional layer uses 64 filters to create 64 corresponding feature maps. The convolution operation is given as follows:
where x is the input, c is the convolution coefficient, and b is the bias. The norm layer follows each conv layer to speed up the network learning process, represented by the following equation:
where y(i) is the output value of the conv layer normalized by the mean μ and variance σ2 over a mini-batch size and each input channel with, ɛ=10−5. The ReLu function is used to reduce the computation cost due to fast convergence.
As shown in Fig. 5(b), the S-R block begins with an im-res block, which uses two conv-units with different sizes of 3 × 3 and 5 × 5 arranged in a parallel structure (Fig. 5(c)). This design allows the im-res block to extract more diverse information from two separate spatial regions within the same input feature map. In addition, it ensures an optimal balance between classification accuracy and computational complexity when using more than two conv-units as opposed to conventional fixed-size filters in this sub-block. The output of the im-res block is combined by an addition operation, which sums up feature maps from the previous conv-unit. This operation can be expressed as follows:
where ⊕ and C denote the addition operation and the convolutional operation, respectively, Zadd1 is the output of the addition layer and the im-res block, and I/2 is the input values to the im-res block.
The performance of human activity classification depends entirely on the m-D features presented in the input spectrogram images. These features often have low energy levels in the reflected signals from the micromotions of limbs compared to the whole body, making them weak features that can be easily deactivated through conv-units, leading to decreased classification accuracy. To address this issue, we propose using additional skip connections in S-R blocks, inspired by the residual block of ResNet [19]. Instead of applying the traditional residual module, the S-R structure of the S-R block facilitates relearning the former low-energy m-D features twice, thereby improving classification accuracy. In addition, this connection structure helps overcome the problem of vanishing gradients in the network. The output of the S-R block is described as follows:
with
where I/2 is the input of the skip connection and Zadd2 is the result obtained through an addition operation from the skip connections and the forward propagation. As shown in Fig. 5(b), a max pooling (maxpool) layer with a pool size of 3 × 3 and a stride of (2, 2) is arranged at the beginning of each S-R block to down-sample the feature map size for subsequent blocks.
Finally, an output block, which contains a maxpool, a conv-unit, an average pooling (avgpool), a fully connected (FCN) layer, a softmax layer, and a classification (class) layer, is used to classify the action classes. The maxpool and avgpool are used consecutively to extract the largest and average features. The FCN layer has a total number of neurons equal to the number of activities in each given dataset for example, 11 action classes for the noise-added dataset and six action classes in the real dataset. The last two layers are the softmax and classification layers, where the softmax function is applied to generate decimal probabilities for all classes, which provide information to predict the appropriate class. The output of the softmax function can be presented as follows:
where ri(x) is the i-th element of r(x), which is the output feature vector of the FCN layer. Finally, SRCNN predicts the specific action of an incoming signal x based on the class with the highest probability.
IV. Experimental and Comparison Results
In this section, both datasets (simulation and real) are used to assess the activity classification performance of the proposed model. Initially, the noise-added dataset is used to investigate the impact of the number of S-R blocks and channels per convolution layer on the classification accuracy of the proposed model. Subsequently, the proposed model with refined hyperparameters is evaluated on three aspects, namely, classification accuracy, processing time, and complexity, using both datasets, after which it was compared with other CNNs.
The training and testing processes are carried out on a computer configured with an Intel Core i5-12400F 2.5 GHz processor, 32 GB of RAM, and an RTX 3060Ti GPU. For training, a batch size of 16, an initial learning rate of 0.01 and 20 epochs are used with the optimizer’s stochastic gradient descent. Moreover, five-fold cross-validation is employed on both datasets to evaluate and compare the performance of the proposed model with other models.
1. SRCNN Performance Evaluation
1.1 Impact of filter number on the proposed model’s performance
In this subsection, we first fix the model with 5 S-R blocks and change the number of filters in each convolutional layer to determine the most suitable number for the proposed model. Fig. 6 depicts the results for three metrics: accuracy, prediction time, and the number of learnable parameters when the number of filters is set to 32, 48, 64, and 96, respectively. The model using 64 channels achieves a good balance between accuracy and inference time for the activity classification task. Specifically, the accuracy of the model using 64 filters reaches 95.64%, which is significantly higher than with 32 and 48 filters and only marginally lower than with 96 filters. Although the accuracy with 96 filters is 95.71%, which is slightly higher than with 64 filters, it comes at the expense of computation time and complexity. The prediction time is 8.4 ms and the number of learnable parameters is 2.4 million (M).
A similar trend occurs when the number of S-R blocks is fixed at 4 and 6 (as shown in Table 1). As a result, using 64 filters for each convolutional layer provides the best trade-off for the proposed model.
1.2 Impact of S-R block number on the proposed model’s performance
In this experiment, to assess the effect of the number of S-R blocks on the accuracy of the proposed model, we fix the number of filter channels at 64. Next, the number of S-R blocks is changed incrementally from 3 to 7. The average classification accuracy of SRCNN as a function of SNR with different numbers of S-R blocks is indicated in Fig. 7.
Fig. 7 shows that the correct recognition rate increases with higher SNR levels and more S-R blocks. However, the recognition accuracy slightly degrades when the number of S-R blocks reaches 7. The average accuracy improves significantly with a small number of blocks (about 3.5% from 3 to 6 blocks) and slightly decreases with a large number of blocks (around 0.5% from 6 to 7 blocks). This finding suggests that using a large number of blocks may approach the limit of the network’s learning efficiency. Although the complexity of the model is lower when using 4 or 5 S-R blocks, the model is insufficiently deep, leading to inadequate clearly defined features for classifying each specific human action. For the best trade-off between accuracy and time consumption, we proposed using a model with 6 R-S blocks and 64 filter channels to compete with other existing models.
For further evaluation, the confusion matrix of SRCNN at 10 dB is reported in Fig. 8, which shows that all activities have correct recognition rates greater than 96%, with two notable actions, “walk to fall” and “walking,” achieving a classification accuracy of 100%. This result is due to the clearer separation of the spectral features of these two actions compared to the other actions considered in this study.
1.3 Impact of input size on the proposed model’s performance
Higher-resolution images contain more information and features, which will help the classifier achieve higher accuracy. However, this improvement comes at the cost of increased computational complexity, processing time, and memory requirements. In this subsection, we explore the impact of different input sizes specifically 612 × 612 × 1, 306 × 306 × 1, and 153 × 153 × 1 on the performance of the proposed model. The performances of the proposed model in terms of classification accuracy, training time, and prediction time for different image sizes are summarized in Table 2. It can be observed that the model with an input size of 306 × 306 × 1 significantly improves average classification accuracy compared to the model with an input size of 153 × 153 × 1 while showing a marginal difference compared to the model with an input size of 612 × 612 × 1.
In terms of training and prediction times, the model with an input size of 306 × 306 × 1 demonstrates robust processing time. Specifically, under identical training and testing conditions, the model with an input size of 306 × 306 × 1 achieves fast performance (192 minutes for training time and 6.5 ms for prediction time), which is slightly slower than model with an input size of 153 × 153 × 1 but considerably faster than the model with an input size of 612 × 612 × 1. Therefore, we select the input size of 306 × 306 × 1 for the proposed model.
2. Performance Comparison
This section presents a performance comparison between the proposed SRCNN and seven existing CNN models, including RepVGG [20], MobileNet-V2 [21], ResNet [19], DINN [13], DIAT-RadHARNet [22], ConvNeXt-T [23], and Xception [24], focusing on their ability to classify human activities on two concrete datasets (simulation with noise and real).
2.1 Overview of existing models
Among the seven models selected for comparison, Xception and MobileNet-V2 are well-known models published in 2017 and 2018, respectively. Known for their high classification accuracy in ImageNet datasets, these models are widely adopted in computer vision tasks via transfer learning. The next four networks, ConvNeXt- T, RepVGG, DINN, and DIAT-RadHARNet, are state-of-the-art models published between 2020 and 2022. ConvNeXt-T, a variant of the general ConvNeXt model that was announced in 2020, achieved significant classification results for the ImageNet dataset. RepVGG is a developed version of VGG network that applies the reparameterization technology to improve classification accuracy while significantly reducing processing time. DINN is a customized CNN inspired by dense connection, which provides the re-usability of loss features due to forward propagation during training for human activity recognition. DIAT-RadHARNet is also a lightweight DCNN model designed to classify the six suspicious activities. Finally, ResNet, originating from ImageNet and COCO 2015, is a renowned model and is regarded as the first version to use traditional residual blocks.
2.2 Comparison results on the noise-added dataset
Table 3 depicts the comparison results of the proposed model with other CNN models. DINN emerges as the fastest network, with an average prediction time of just 5.9 ms; however, its average accuracy is the lowest at approximately 62% compared to other networks. This result stems from DINN’s simple structure (with only 1.5 M learnable parameters) which leads to less efficient feature extraction from actions and potential confusion with features extracted from the added noise. Despite having the highest number of learnable parameters (≈27.8 M), ConvNeXt- T achieves a classification accuracy of only 75% and has the second-longest processing time (approximately 12 ms), slightly faster than DIAT-RadHARNet (15 ms). ConvNeXt-T adopts a structure similar to ResNet-50 but with specific modifications aimed at maximizing classification performance on the ImageNet dataset, which may not be optimal for the new spectral image datasets featuring m-D signatures with varying noise levels.
With diverse feature extraction capabilities and effective reuse of features from previous layers, our proposed model achieves the highest classification accuracy (over 96%) with the fewest learnable parameters (around 1.2 M) compared to DIAT- RadHARNet, MobileNet-V2, Xception, and ResNet, which have 2 M, 2.2 M, 20.8 M, and 23.5 M learnable parameters, respective ely. Furthermore, our model boasts an execution time of 6.5 ms, comparable to DINN and significantly faster than MobileNet-V2 (7.3 ms), ResNet (7.6 ms), Xception (9.8 ms), and DIAT-RadHARNet (15 ms).
The competition in average accuracy is evaluated across various noise levels, as shown in Fig. 9(a). The results indicate that DINN achieves the lowest accuracy, with only 40% at −15 dB and 83% at 10 dB. Notably, Xception exhibits the highest recognition accuracy at −15 dB due to its deepest structure (170 layers) and longest processing time (9.8 ms). However, as noise levels increase within the range of −10 to 10 dB, the proposed model consistently outperforms others in terms of accuracy, attributed to its optimized two-layer S-R block design. Remarkably, at −5 dB, the proposed model achieves a recognition accuracy of more than 95%, significantly surpassing all others.
In addition to the average accuracy parameter, the proposed model is also evaluated using three additional important metrics: precision, recall, and F1-score. Precision is defined as the ratio of the number of true positives (TP) among the points classified as positive (TP + FP). Recall is defined as the ratio of true positives (TP) to those actual positives (TP + FN). F1-score represents the harmonic mean of precision and recall. The parameter values are calculated as follows:
where TP, FP, TN, and FN are short for true positive, false positive, true negative, and false negative, respectively. The three metrics are presented in Fig. 9(b), 9(c), and 9(d), where the proposed model attains the highest accuracy and stability across varying noise levels.
To evaluate the accuracy improvement of the proposed SRCNN model that incorporates enhanced S-R blocks compared to other models, we extracted 11 features from 120 samples of four activities directly from the output of the FCN layer and plotted them on a 2D plane. The selected activities included grab, sit, walk to fall, and walk to sit. The distribution of these features is shown in Fig. 10. Intuitively, the clustering of extracted features from each model reflects its corresponding classification accuracy. Models with widely dispersed or overlapping features are more challenging to classify accurately, resulting in decreased accuracy. Conversely, activities with clearly clustered and separate features are easier to classify. Specifically, features extracted by DINN for the actions “walk to fall” and “walk to sit” exhibit a significant overlap of approximately 50% (violet and green points in Fig. 10(a)). This overlap complicates the accurate classification of these actions, resulting in a high false prediction rate and the lowest average accuracy observed in the DINN model.
Furthermore, we assessed the distribution of extracted features from SRCNN and ResNet. Fig. 10(e) shows the extracted features from ResNet, where features of grab, sit, and walk to fall are relatively scattered and overlap with each other, leading to confusion during classification. By contrast, Fig. 10(h) shows features extracted by the proposed SRCNN model, which are clearly clustered and separated without overlap. This distinction explains why the proposed model achieves significantly improved accuracy compared to models using traditional residual blocks.
2.3 Comparison results on the real dataset
To enhance confidence in our findings, we compared the proposed network with the others using real dataset results. Table 4 presents the numerical results, highlighting SRCNN’s superior recognition accuracy. Specifically, the proposed network achieves an accuracy of 96.63%, a precision of 96.87%, a recall of 96.79%, and an F1-score of 96.83%, thereby outperforming the other models considered. Remarkably, the proposed network featuring a two-layer residual connection structure significantly improves classification accuracy by 4.28% and reduces the average execution time by 1.1 ms compared to ResNet, which uses traditional residual connections.
Moreover, Table 4 includes classification results for specific activities on the real dataset, demonstrating the proposed model’s strong performance, with six activities achieving over 91% accuracy.
V. Conclusion
In this study, we proposed the SRCNN model with six S-R blocks and 64 filters allocated in each convolution layer for daily human activity detection and identification based on m-D signatures. The proposed model efficiently extracts and reuses features through a two-stacked residual structure. SRCNN exhibits outstanding recognition performance compared to seven other state-of-the-art networks across various SNRs. In addition, our model achieves the highest accuracy, precision, recall, and F1-score on both simulated and real datasets. Future works will focus on optimizing and verifying the proposed model through experimental measurement. Subsequently, the optimized model will be implemented in a real system for human activity classification applications.
References
Biography
NgocBinh Nguyen, https://orcid.org/0000-0002-7504-0520 received his B.Sc. and M.Sc. degrees in Electronics and Telecommunications from Telecommunication University, Khanh Hoa, Vietnam and Le Quy Don Technical University, Hanoi, Vietnam, respectively. He is now a PhD candidate at Le Quy Don Technical University, Hanoi, Vietnam. His current research interests include radar signal processing, imaging processing, and deep learning.
Van-Sang Doan, https://orcid.org/0000-0001-9048-4341 received the M.Sc. and Ph.D. degrees in Electronic Systems and Sevices from the Faculty of Military Technology, the University of Defence in Brno, Czech Republic in 2013, and 2016, respectively. He was awarded the Honors degree three times by the Faculty of Military Technology at the University of Defense in Brno, in 2011, 2013, and 2016. He served as a Postdoctoral Research Fellow at the ICT Convergence Research Center at Kumoh National Institute of Technology, Republic of Korea from 2019 to 2020. He is currently working at the Faculty of Communication and Radar, Vietnam Naval Academy, in Nha Trang City, Khanh Hoa Province, Vietnam. His research interests include radar, sonar, and communication systems, signal processing, and deep learning.
MinhNghia Pham, https://orcid.org/0000-0002-0732-0213 was born in 1980. He received his B.Sc. and M.Sc. degrees in Electronics Engineering from the Le Quy Don Technical University, Vietnam, in 2005 and 2008, respectively, and the Ph.D. degree in Information and Communications Engineering from Harbin Institute of Technology, China in 2014. He is currently a lecturer at Le Quy Don Technical University, Vietnam. He currently focuses on polarimetric synthetic aperture radar image processing, radar signal processing, deep learning and signal processing.
VanNhu Le, https://orcid.org/0000-0001-7023-4265 received his M.S. and Ph.D. degrees in the Harbin Institute of Technology in 2012 and 2016, respectively. After, he was a postdoc in the College of Optical Science and Engineering at Zhejiang University in Hangzhou City, Zhejiang, China from 2016 to 2018. Now, he is a lecturer at Le Quy Don Technical University. His research fields are fluorescence super-resolution microscopy, wavefront coding systems, image processing, optical design, radar signal processing, and deep learning.