Arm Motion Classification Based on Radar–Camera Cross-Learning with a Generative Model

Article information

J. Electromagn. Eng. Sci. 2026;26(2):170-181
Publication date (electronic) : 2026 March 31
doi : https://doi.org/10.26866/jees.2026.2.r.353
1Department of Electronic Engineering, Sogang University, Seoul, Korea
2Depqrtment of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon Daejeon, Korea
*Corresponding Author: Youngwook Kim (e-mail: youngkim@sogang.ac.kr)
Received 2025 May 27; Revised 2025 September 4; Accepted 2025 September 29.

Abstract

In this paper, we classify human arm movements using point clouds obtained from a millimeter-wave multiple-input multiple-output (MIMO) radar integrated with radar–camera cross-learning. When a radar receives signals reflected by the entire human body, delineating the specific details of arm movements is challenging. Our approach involves skeletonization of human point clouds measured by a MIMO radar and then processing them using diverse deep learning models. To enhance the skeletonization of the point cloud model, cross-learning between radar and camera systems is implemented. The point clouds obtained from the radar are trained on camera data, since they offer a higher resolution. Two training methods are investigated in this study. The first method utilizes a two-dimensional convolutional neural network (2D-CNN) regression model to extract the arm angles of the skeleton model for determining the class of arm motion. The second method employs an autoencoder (AE) along with data augmentation, performed using a stable diffusion model, to enhance the robustness of feature extraction. The feasibility of the proposed feature extraction methods is validated through experimentation based on six distinct arm motions, resulting in arm motion classification accuracies of 95.23% for the 2D-CNN method and 98.1% for the AE-based method. These outcomes underscore the efficacy of the proposed techniques, which show significant promise for application in detailed human motion classification using radar.

I. Introduction

Safety, security, and surveillance concerns have contributed to the expansion of radar into a wide range of applications. In particular, human motion analysis using radar is employed in security systems, sports science, military training, traffic management, and rehabilitation therapy. Radar technology for human motion analysis is especially valuable in harsh weather environments, such as at night or in foggy conditions. Even in poor visibility situations and through-object scenarios, radar systems continue to exhibit human detection capabilities. Their robust performance under adverse weather conditions has significantly enhanced operational safety and reduced the risk of accidents, especially in the case of automotive and industrial radar. In addition, radar is highly effective in situations that prioritize the protection of human privacy since it can capture target information without causing privacy issues.

A number of different domains can be used to represent human motions measured by radar. For instance, spectrogram analysis provides a time-frequency representation of a signal, offering insights into variations in frequency information across time, thereby helping capture the dynamics of human activities [1, 2]. Micro-Doppler signatures contain Doppler information from limb motion, reflecting the features of human activity [3, 4]. A range-Doppler diagram contains range versus Doppler frequency information, allowing for the visualization of position and velocity in a radar scene [5]. In addition, the time-varying characteristics of a range-Doppler diagram offer dynamic features for target classification [6]. Meanwhile, the range-angle domain presents the target locations in space. Furthermore, point clouds derived from multiple-input multiple-output (MIMO) radar returns have been employed to represent the spatial distribution of the scattering points of a target. This three-dimensional (3D) representation of a human subject allows for a more detailed analysis of the environment and enables the identification of specific human activities based on spatial variations within the point clouds [7].

Human activity classification using radar involves feature extraction and boundary construction processes. Recent research in this area has focused on enhancing the accuracy of human data classification by integrating deep learning technologies. For instance, two-dimensional (2D) convolutional neural networks (CNNs) have been employed to identify spatial relationships in spectrograms [2, 8] or range-Doppler diagrams [9]. Meanwhile, 3D-CNNs are particularly effective in capturing spatiotemporal patterns within time-varying radar data. When applied to a temporal dimension in addition to 2D data, the model learns patterns of human movements in both space and time [10]. In addition, recurrent neural networks (RNNs) are well suited for capturing temporal dependencies, making them valuable for tasks in which the sequence of events is critical. It has been used to recognize patterns in time-series data obtained from human subjects [11]. Furthermore, by extracting temporal feature patterns from point cloud data generated by a MIMO radar system, researchers was able to significantly enhance the capability of human activity classification, which was not possible using spectrograms or range-Doppler diagrams [12]. This improvement was achieved using long short-term memory (LSTM), which is a variant of RNNs.

Beyond the broader context of human motion analysis, there has been a notable surge in interest specifically directed toward understanding human arm motion. Applications of arm motion detection include gaming, sign language recognition, medical assessments, and navigation in virtual reality environments [13]. This growing interest has materialized in studies leveraging depth cameras and CNNs for the precise identification and classification of distinct arm motions. In this context, an alternative approach involving the integration of point cloud data obtained from millimeter-wave radar using a lightweight PointNet-based classifier for effective arm motion analysis has been proposed [14]. While cameras excel in providing high-resolution visual insights, their capabilities are limited in dark or low visibility environments, and they may pose challenges related to privacy. In contrast, radar technology exhibits superior capabilities in security and surveillance, given that it is not affected by the environment but is characterized by poor spatial resolution. Recognizing these complementary strengths, the use of both camera and radar technologies can be considered a strategic integration poised to unlock enhanced human body motion classification performance [15].

In this paper, we propose an arm motion classification method based on cross-learning between radar data and camera data using radar point clouds. Human point clouds extracted from a millimeter-wave MIMO radar are employed to construct a skeletal structure that provides essential information for accurate arm motion classification, indicating an advancement in data processing techniques. To the best of our knowledge, human point cloud data have yet to be studied using cross-learning. We use deep learning models to devise a process that transforms point clouds into a skeletal structure that can help analyze arm motions, thereby elevating the accuracy of classification. In this process, cameras are used to guide radar data in the reconstruction of human skeletons through cross-learning.

Furthermore, we employ two distinct approaches for arm motion classification. The first approach introduces a 2D-CNN regression model to estimate arm angles from point cloud data. The model is trained using point clouds, whose true arm angles are estimated from the skeleton captured by the camera. A human skeleton is extracted from the camera image using the OpenPose algorithm [16]. Subsequently, the 2D-CNN regression model is used to generate a refined skeleton using the estimated arm angles. These data are then fed into a 3D-CNN designed for arm motion classification based on time-varying features. Meanwhile, the second approach integrates an autoencoder (AE) to convert the point cloud data into a more accurate skeletal representation. The AE is trained on the data consisting of point clouds obtained from the radar as inputs and the skeleton extracted from the camera as outputs. Furthermore, to overcome data deficiency issues, we employ a stable diffusion model [17] for data augmentation. Following the generation of the skeleton by the AE, a 3D-CNN is employed for the final arm motion classification. Overall, by synergistically incorporating radar and camera data, the proposed models exemplify a cross-learning approach to arm motion classification, demonstrating the potential for significant advancement in this domain.

This paper is organized as follows. In Section II, we describe the radar setup and data acquisition process. In Section III, we explain the preprocessing applied to the raw data obtained from the radar. In Section IV, we classify the 3D point cloud data received after preprocessing using 3D-CNN without considering any sophisticated process as a reference. In Section V, we skeletonize the point cloud data using a 2D-CNN regression model for angle extraction, and then classify the arm motion using 3D-CNN. In Section VI, we skeletonize the point cloud data using AE with the aid of data augmentation, followed by arm motion classification using 3D-CNN. Finally, in Section VII, we discuss both processes mentioned above.

II. Target Measurements using Radar

To obtain the necessary data for training, we captured point clouds using the RETINA-4SN radar developed by Smart Radar Systems Ltd. It is a frequency-modulated continuous wave (FMCW) radar operating in the millimeter-wave range. The system employs a multi-input/output antenna along with four FMCW radar chips (Texas Instruments) cascaded to create 192 virtual receiving channels. Notably, the 2D antenna arrays are non-uniformly distributed, enhancing angular resolution. Operating within the frequency range of 77–81 GHz, RETINA-4SN provides a radar detection range of 7 m. In this study, the azimuth field of view (FOV) and elevation FOV were both set at 90 degrees, with the update rate configured at 50 ms.

First, the signals received by the 192 channels were used to construct their respective range-Doppler diagrams. Once the targets were identified in the range-Doppler diagram, each point scatterer was localized by estimating the range and direction of arrival in the azimuth and elevation angles. Thereafter, by determining the 3D location of each target detected in the range-Doppler domain, 3D point clouds were constructed. Effectively, when a human subject was measured using the RETINA-4SN radar, the human point clouds resembled an actual human shape. The signal processing procedures for obtaining a point cloud plot are shown in Fig. 1.

Fig. 1

Human point-cloud construction using a MIMO FMCW radar.

To construct the training dataset for sensor cross-learning, data were collected using both a radar and a camera. Notably, the two sensors were simultaneously used to acquire point cloud data from the radar and skeleton data from the camera at the same time.

The camera images provided a true skeleton of the human body, which was then used to train the point clouds acquired from the radar. Fig. 2 illustrates the experimental setup and environment. Fig. 2(a) shows that the length of the x-axis is 2.3 m, while the y-axis ranges from 1 m to 3.3 m. Furthermore, as shown in Fig. 2(b), the z-axis is within 0.93 m for the lower part and within 1.4 m for the upper part. The reason for maintaining a distance of 0.93 m from the ground was to ensure that the radar could measure both the upper and lower bodies of the participants. To efficiently extract point clouds from the human subject, the participants performed a specific movement at a distance of 1.5 m from the radar. The subjects’ movements were simultaneously captured by a camera recording a video at the same location. The recorded video was processed using the OpenPose algorithm to extract human skeletons from the captured images.

Fig. 2

Experimental environment: (a) horizontal, and (b) vertical.

We aimed to classify postures and motions resembling aircraft marshaling signals, which feature simple patterns. As shown in Fig. 3, we investigated six classes of postures and motions— four dynamic and two static. The motions are as follows. In Action [A], the person is standing still with their right arm bent at a 90° angle with the elbow pointing out to the side and the forearm positioned vertically. In Motion [B], the person is repeatedly bending their left elbow toward their head and then extending the arm straight out to the left side.

Fig. 3

The six arm gestures: (a) Action [A]: Standing with right arm bent up at the side, (b) Motion [B]: Bending and straightening the left arm sideways, (c) Motion [C]: Bending and straightening both arms sideways, (d) Action [D]: Standing still with right arm at the side, (e) Motion [E]: Bending and straightening the right arm sideways, and (f) Motion [F]: Both arms up and down at the sides.

In Motion [C], the person is repeatedly bending both their elbows toward their head and then extending the arms straight out to each side. In Action [D], the person is standing still with their right arm extended straight to their side. In Motion [E], the person is repeatedly bending their right elbow toward their head and then extending the arm straight out to the right side. Finally, in Motion [F], the person is continuously raising and lowering both arms at the sides.

Five participants performed each movement for 500 frames. Effectively, we obtained 500 point-cloud plots for each class per participant from the measurement results. Consequently, it was possible to acquire 3,000 data points per participant, resulting in a dataset size of 15,000. We aimed to conduct training using this measured dataset. Notably, measurements were conducted after approval from the Institutional Review Board at Sogang University (IRB Approval No. SGUIRB-A-2308-33).

III. Data Preprocessing

The raw data consisted of points in the form of (x, y, z, d, p), with each dimension representing a distinct physical parameter. In particular, x denotes the distance along the x-axis, y refers to the distance along the y-axis, z is the distance along the z-axis, d indicates the Doppler shift (velocity), and p represents the power of the reflected signal. This dataset offered comprehensive information about each point in space, enabling us to detect changes in an individual’s arm motions within a reference space. Furthermore, for learning spatial changes, PointNet can be used for training using point cloud data. However, since we wanted to examine the temporal dynamics of arm motions, we opted for a 3D-CNN instead. This allowed us to visualize the relative positions and magnitudes in the space, presenting the raw data points within the reference space across time.

Initially, our focus was on arm motion classification. Notably, due to the characteristics of arm motion, we conducted measurements by fixing the distance along the y-axis, with the aim of emphasizing the information related to the x-axis and z-axis. In other words, the information suitable for our objectives corresponded to the x position, z position, and power, as illustrated in Fig. 4, which presents the corresponding point clouds of the human posture in Fig. 2(a).

Fig. 4

2D point clouds for one frame of Motion [C].

We aimed to construct 3D data by accumulating the 2D motion images for each frame and stacking them over the entire measurement time. Our objective was to classify the motion over 20 frames. Given that the frame period of the radar synchronized with the camera was 100 ms, we obtained data for a total motion duration of 2 seconds.

Fig. 5(a) presents examples of the extracted data when a participant is performing Motion [A]. The point clouds in Fig. 5(b) were obtained through radar signal processing. Fig. 5(c) displays the results of applying the OpenPose algorithm to the image captured by the camera to obtain the participant’s skeleton. Finally, Fig. 5(d) presents the skeleton obtained by processing the camera image.

Fig. 5

(a) Participant, (b) point cloud, (c) camera image along with the skeleton, and (d) skeleton extracted from (c).

IV. Classification of Point Clouds using 3D-CNN

First, we implemented 3D-CNN on the original point clouds constructed using the procedure depicted in Fig. 6 as a reference, with the aim of comparing the results of this basic training method for point clouds [18] with those of the proposed approaches in Sections V and VI. Notably, a 3D-CNN is an extension of a conventional 2D-CNN that processes three-dimensional data by employing 3D filters. By applying 3D-CNN to 2D data across time, the time-varying features of the 2D data can be captured. In this research, the data has three dimensions of the x-z-plane and time. A schematic diagram of the 3D-CNN adopted in this study for training is provided in Fig. 7, and its layer structure is explained in Table 1.

Fig. 6

Composition of the 3D data constructed by accumulating 2D point clouds across time.

Fig. 7

The 3D-CNN process using point clouds.

Structure of the 3D-CNN classification model for point clouds

During the training process for 3D-CNN model, we took measures of the test error to prevent overfitting. The training data were collected from four individuals—Humans 1, 2, 3 and 4— while the test data comprised data from Human 5 alone. This approach was adopted to prevent the model from fitting specifically to a particular individual’s data. Overall, the training dataset comprised 11,520 data samples, and the test dataset consisted of 2,880 data samples. To investigate the performance of the 3DCNN fairly, we examined its test accuracy by varying the number of layers and filter sizes. Fig. 8 presents the test accuracy corresponding to these parameter variations. Furthermore, Table 2 plots the confusion matrix for validation. Notably, a classification accuracy of 89.97% was achieved. This accuracy will later be compared with that obtained using the proposed methods.

Fig. 8

Evaluation of test accuracy based on varying parameters for 3D-CNN.

Confusion matrix using point clouds with 1-fold validation

V. Classification of Point Clouds using Angle Estimation

Given that aircraft marshalling signals can be classified based on the positions and movements of both arms, we figured that learning from the entire set of original point clouds may not be the most effective approach, since they include unnecessary information that can be misleading. Therefore, we implemented a method to extract only the angles formed at the top of the arms and at the armpits from the original point clouds to reconstruct clearer images based on relative angles. In doing so, we accounted for the fact that training the 3D-CNN with cleaner images could lead to higher classification accuracy.

1. Arm Angle Estimation using 2D-CNN

The four angles presented in Fig. 9 serve as features for classifying arm motions. A1 refers to the angle between the right lower arm and the right upper arm, A2 denotes the angle between the right upper arm and the right side, A3 signifies the angle between the left upper arm and the left side, and A4 indicates the angle between the left upper arm and the left lower arm. Given the nature of arm motion, we expected the variation trend of the four angles to be relatively consistent for each motion. For instance, Motion [A] was expected to maintain relatively constant values for all four angles throughout the measurement period, while Motion [E] was likely to show variations in A1 and A2 but maintain constant values for A3 and A4. We chose to leverage these characteristics for arm motion classification.

Fig. 9

The four angles, along with labels, considered for the 2D-CNN regression model.

To estimate the four angles from the given point clouds, a 2D-CNN regression model was employed. The training data for this model were constructed by extracting true angle features from the skeleton map of the images captured by the camera. The camera data gathered during data acquisition were processed by implementing the OpenPose algorithm developed by Carnegie Mellon University in 2017. Notably, this algorithm utilizes a deep CNN to learn human features from camera images and to accurately estimate body skeletons. Moreover, it can generate body skeletons in cases where the entire shape of the body is not completely captured by the camera. After obtaining the coordinates of the skeleton joints corresponding to the hand, elbow, and shoulder, we calculated the angles. Using the 2D point clouds as input data, we trained the 2D-CNN regression model to extract data pertaining to the four angles from the skeleton map. A schematic diagram of the 2D-CNN model for angle extraction is depicted in Fig. 10, and the specifics of its structure are outlined in Table 3.

Fig. 10

Procedure for estimating the four angles using the 2D-CNN regression model.

Structure of the 2D-CNN regression model

To ensure the validity of the training process and evaluate training performance, data from multiple individuals were employed to construct the training and test datasets for the 2D-CNN regression model. The training dataset was composed of 11,520 data points, and the test dataset consisted of 2,880 data points. The root mean square error (RMSE) of the regression model was calculated to be 25.33°. Although this error is not ideal, the arm movements were classified based on the temporal movement of the arms rather than the angle itself, resulting in better accuracy.

Fig. 11(a) illustrates the original point clouds used as input for testing the performance of the regression model, while Fig. 11(b) depicts the reconstructed skeleton map obtained using the regression model, depicting four distinct angles. Notably, when generating a skeleton based on these angles, the segments of the body— the head, shoulders, and legs, excluding both hands and elbows— were fixed at specific positions in the image. Subsequently, based on the lengths of both upper arms and forearms along with angles A1, A2, A3, and A4, the relative positions of the hand and elbow joints were calculated to reconstruct the skeleton image.

Fig. 11

(a) Point clouds used as input to the regression model for testing Motion [F]; (b) Skeleton reconstruction obtained as output of the 3D-CNN model depicting the four angles.

While directly inspecting the point clouds in Fig. 11(a) may be challenging in terms of precisely discerning the nature of the motion or identifying changes in specific body parts, the reconstructed skeleton in Fig. 11(b) is visually clearer. Therefore, we determined that utilizing such visually explicit data could enhance classification performance.

2. Classification of Angle-based Images using 3D-CNN

The angle features extracted by the regression model were used to reconstruct the skeleton map, which was then employed for classification. Although the classification could have been conducted using only the angle values, our objective was to reconstruct images using angle information to ascertain the optimal features when using the same classifier, 3D-CNN. Similar to the 3D point clouds depicted in Fig. 7, we aimed to construct a 3D skeleton map across time and train it using a 3D-CNN classification model. A schematic diagram of the 3D-CNN classification model is depicted in Fig. 12. Its structure is the same as that noted in Table 1.

Fig. 12

Classification process using 3D-CNN based on the reconstructed human model.

To ensure the validity of the training process and to evaluate training performance, data acquired from different individuals were employed to construct the training and test datasets, following which 3D-CNN training was conducted. The training dataset comprised 11,520 data points, and the test dataset consisted of 2,880 data points. A test accuracy of 94.38% was achieved, showcasing improved performance compared to the 89.87% obtained using the original point clouds. Table 4 presents the confusion matrix for validation.

Confusion matrix using 2D-CNN-based skeleton with 1-fold validation

VI. Classification with Autoencoder Enhanced by a Generation Model

1. Data Augmentation through Stable Diffusion Model

While the RETINA-4SN 4D imaging radar can capture non-stationary motions effectively, it may fail to capture slow motions or stationary objects, since zero Doppler signals are generally suppressed in the clutter rejection process. Point clouds that are not captured owing to clutter rejection or imperfect measurements resulting from noise can act as a hindrance to deep learning model training. Therefore, carefully selecting well-measured data is integral to training deep learning models. However, such a dataset may be limited in quantity or time-consuming to obtain. Hence, it is necessary to augment the data.

Data augmentation is a machine learning technique that is used to increase the diversity of the training dataset by applying transformations to existing data. Among the various tools available for data augmentation, a generative adversarial network (GAN) is one that is widely used [19]. However, a major limitation of GAN is the mode collapse issue, which refers to a scenario in which either the generator or the discriminator becomes overly dominant in learning compared to the other, thereby hindering the progress of the other’s learning. In GAN, such instability in training due to mode collapse greatly affects the learning process. This mode collapse issue encountered in GAN is addressed by another data augmentation method—the stable diffusion model [20]. This model converts a user’s input text prompt into an embedding vector using a text encoder, where noise is added during the diffusion process and removed during the reverse diffusion process. The embedding vector determines the direction of the noise control process, helping to generate images based on the given text prompt that are clearer and of a higher quality than GAN. As a result, the stable diffusion model is considered more suitable for high-quality data augmentation.

For this study, we employed the Stable Diffusion Web UI provided by AUTOMATIC 1111. As evident from Fig. 3, Actions [A] and [D] are static—remaining the same across all frames—whereas Motions [B], [C], [E], and [F] vary in each frame. Considering this aspect, for 2D data augmentation, it was crucial to decompose each motion into multiple static motions per frame, forming a series of postures. However, due to the limited amount of data and the presence of measurement errors or noise in some frames, embedding every frame individually was not feasible. Therefore, meaningful frames pertaining to the six movements were selected, converted into 2D image-based posture data, and then used for training. Ultimately, a total of 50 out of the actual 100 images per action were embedded in the model. Notably, the stable diffusion model operates optimally with [512×512]-sized RGB images. Hence, the point cloud data presented in Fig. 4 was adjusted to match the color-map in Fig. 13(a), and the original [64×64] point cloud data was resized to [512×512] images. Tokens were assigned for each action, and embedding was conducted. The embedding rate was set at 0.005, batch size at 1, and maximum steps at 50,000. Furthermore, among the 50 images from the actual data that were not employed for embedding in the stable diffusion model, we selected 10 for image-to-image data augmentation using positive prompts specific to each action’s token.

Fig. 13

(a) Point cloud image, and (b) augmented point cloud image in a frame.

In a stable diffusion model, text prompts provided by the user are divided into positive and negative prompts. Positive prompts are used to specify the characteristics and content of the image to be generated, while negative prompts are used to indicate the elements of the features that should not be included. In this study, we used only positive prompts. Notably, data augmentation was conducted correctly only when the actual data and positive prompts were applied simultaneously. During data augmentation, the image size was maintained at [512×512], consistent with the original images, with sampling steps set at 150, denoising strength at 0.5, and a batch count of 100. The batch count ranged from 1 to a maximum of 100, representing the data augmentation ratio. This implies that for a batch count of 100, 1,000 images were generated from 10 actual data images using the stable diffusion method.

Ultimately, the complete training dataset comprised images generated using stable diffusion, as well as the actual data that had not been subjected to the stable diffusion embedding process. Fig. 13(b) presents an example of the augmented data obtained upon applying the stable diffusion model to Fig. 13(a). Utilizing data pertaining to Human H1, we embedded a total of 14 actions for data augmentation. Effectively, the training data comprised both Human H1’s actual data and the data generated through data augmentation from the stable diffusion model. Overall, the entire dataset consisted of 14,325 images resized to [64×64].

2. Skeleton Image Generation using AE

We applied AE to the point cloud data to learn the features and patterns of the skeletons corresponding to the different movements, ultimately obtaining clear skeleton images. Notably, AE is an unsupervised learning model that is widely utilized in deep learning. It is usually employed to learn efficient representations of input data and then reconstruct them. It compresses data into a low-dimensional latent representation and endeavors to reconstruct it as closely as possible to the original data. To achieve this, the encoder and decoder are symmetrically structured. For this study, we employed a 2D-CNN-based AE, which uses 2D-CNNs to detect spatial features in image data. We used 2D-CNN layers in both the encoder and decoder sections to accurately capture the spatial characteristics of the data.

The AE was structured to consider the point cloud data as input and reconstruct skeleton data as output. In other words, the AE was trained to reconstruct skeleton images from point cloud data. The data augmented by the stable diffusion model were used as input data, while the skeleton images generated for each of the 14 categorized movements served as the output data. The use of AE also aided in removing the noise originating from the point cloud data.

Fig. 13(a) shows that as the signal strength increases, it tends toward the red spectrum, and as the signal weakens, it leans toward the blue spectrum. To normalize the signal strength in the point cloud data from 0 to 1, the hue, saturation, and value (HSV) color model was applied. When extracting hue data from the HSV model, higher array values denote areas closer to the red spectrum, while lower values signify proximity to the blue spectrum. Initially, [512×512×3] point clouds data were transformed into [512×512×1], and then resized to [64×64×1] to match the AE’s input size. Fig. 14(a) presents the obtained 2D point cloud data, ranging from 0 to 1. Fig. 14(b) illustrates the AE’s output, depicting the skeleton data, also ranging from 0 to 1. Notably, the entire dataset comprised 14,325 pairs of images, with 90% utilized as training data and 10% reserved for the validation set.

Fig. 14

Examples of input and output images for training: (a) input and (b) output.

The architecture of the final AE is depicted in Fig. 15. More details on its structure are listed in Table 5. Fig. 15 clarifies that, in the encoder, the input data sequentially passes through convolutional layers, ReLU layers, and max-pooling layers. Meanwhile, the decoder consists of transposed convolutional layers and ReLU layers. Notably, the decoder has the same number of filters as the encoder. At the end of the decoder, a convolutional layer with a single filter is connected to a regression layer to generate the output images. To train the AE, the optimizer was set to Adam, and the data were shuffled at each epoch, with the maximum epoch set to 60. Ultimately, the validation RMSE for the trained model reached 1.9595.

Fig. 15

Structure of 2D-CNN-based AE.

Structure of the 2D-CNN-based AE model

Fig. 16 presents the results of the autoencoder model’s performance in generating skeleton images. Fig. 16(a) and 16(b) show the point cloud data used as input for the autoencoder. More specifically, Fig. 16(a) represents a segment of the motion with both arms raised [F], while Fig. 16(b) signifies a new untrained motion in which the left arm is held out horizontally. The outputs from the autoencoder model are shown in Fig. 16(c) and 16(d). When compared to the ground truth skeleton images in Fig. 16(e) and 16(f), it is clear that the autoencoder successfully generated distinct skeleton images corresponding to arm motions. This indicates that the autoencoder was able to effectively learn the features of the point cloud data to successfully reconstruct skeletons for arm motions.

Fig. 16

(a) Input point clouds included in training, (b) input point clouds not included in training for test, (c) output of the CNN-based autoencoder for (a), (d) output of the CNN-based autoencoder for (b), (e) true skeleton image obtained from photos of (a).

3. Classification using 3D-CNN

We applied the AE to the data corresponding to each frame of the 3D point clouds, transforming them into 2D skeleton images. As in the case of the 3D point clouds, we aggregated the skeleton images obtained from the AE and combined them into groups of 20 frames to reconstruct a 3D skeleton map. This reconstructed 3D skeleton map was then used for classification. The structure of this 3D-CNN classification model is detailed overview of its structure is provided in Table 6.

Structure of the 3D-CNN classification model

The training dataset, similar to the 3D-CNN classification based on 3D point clouds, consisted of 11,520 data points, and the test dataset consisted of 2,880 data points. Table 7 plots the confusion matrix for validation.

Confusion matrix using the autoencoder-based skeleton with 1-fold validation

Fig. 17 compares the final test accuracies of the method using the original point clouds with those of the two proposed approaches— the combination of 2D-CNN regression with 3D-CNN and the combination of 2D-CNN autoencoder with 3D-CNN. Both proposed methods exhibit improved performance, with the 2D-CNN autoencoder combined with 3D-CNN demonstrating outstanding performance, achieving a test accuracy of 98.85% compared to the 89.97% attained by the baseline model.

Fig. 17

Comparison of test accuracies of the final models.

VII. Conclusion

This research proposes schemes for improving the classification of arm movements based on point clouds obtained from 4D imaging radar. To improve classification accuracy, we employed deep learning models to convert point cloud data into skeleton data. The first model employed a 2D-CNN regression model to enhance human skeleton quality and a 3D-CNN for classification. The 2D-CNN regression model was trained using point clouds obtained from the radar and skeleton images captured by the camera to predict arm angles. Using the trained 2D-CNN regression model, we reconstructed the skeletons of the human subjects based on the obtained arm angles. Subsequently, we evaluated the reconstructed skeleton model using a 3D-CNN for classification. Meanwhile, the second model combined AE and 3D-CNN. To reconstruct the skeleton image from the point clouds, the AE was trained using point cloud data from the radar as input and skeleton images from the camera as output. Subsequently, we evaluated the AE using 3D-CNN for classification. Compared to the 89.97% accuracy of the reference model that applied 3D-CNN to the point cloud data, the first model achieved an accuracy of 94.38%, while the second model attained 98.85% accuracy, demonstrating superior performance. Overall, the fusion of radar data and camera data signifies an innovative approach to arm motion classification, pointing to the potential for significant advancements in this field.

Notes

This work was supported by the Technology Innovation Program (RS-2024-00417302, 50%) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea), and an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. RS-2024-00393808, Efficient design of RF components and systems based on artificial intelligence, 50%).

References

1. Kim Y., Ling H.. Human activity classification based on micro-Doppler signatures using a support vector machine. IEEE Transactions on Geoscience and Remote Sensing 47(5):1328–1337. 2009;https://doi.org/10.1109/TGRS.2009.2012849.
2. Le H., Hoang V. P.. Dop-DenseNet: densely convolutional neural network-based gesture recognition using a micro-Doppler radar. Journal of Electromagnetic Engineering and Science 22(3):335–343. 2022;http://doi.org/10.26866/jees.2022.3.r.95.
3. Narayanan R. M., Zenaldin M.. Radar micro-Doppler signatures of various human activities. IET Radar, Sonar & Navigation 9(9):1205–1215. 2015;https://doi.org/10.1049/iet-rsn.2015.0173.
4. Nguyen N., Doan V. S., Pham M., Le V.. SRCNN: Stacked-residual convolutional neural network for improving human activity classification based on micro-Doppler signatures of FMCW radar. Journal of Electromagnetic Engineering and Science 24(4):358–369. 2024;http://doi.org/10.26866/jees.2024.4.r.235.
5. Kim Y., Alnujaim I., You S., Jeong B. J.. Human detection based on time-varying signature on range-Doppler diagram using deep neural networks. IEEE Geoscience and Remote Sensing Letters 18(3):426–430. 2021;https://doi.org/10.1109/LGRS.2020.2980320.
6. Kim W. Y., Seo D. H.. Radar-based human activity recognition combining range–time–Doppler maps and range-distributed-convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 60article no. 1002311. 2022;https://doi.org/10.1109/TGRS.2022.3162833.
7. Yu C., Xu Z., Yan K., Chien Y. R., Fang S. H., Wu H. C.. Noninvasive human activity recognition using millimeter-wave radar. IEEE Systems Journal 16(2):3036–3047. 2022;https://doi.org/10.1109/JSYST.2022.3140546.
8. Tang L., Jia Y., Qian Y., Yi S., Yuan P.. Human activity recognition based on mixed CNN with radar multi-spectrogram. IEEE Sensors Journal 21(22):25950–25962. 2021;https://doi.org/10.1109/JSEN.2021.3118836.
9. Sakagami F., Yamada H., Muramatsu S.. Accuracy improvement of human motion recognition with MW-FMCW radar using CNN. In : Proceedings of 2020 International Symposium on Antennas and Propagation (ISAP). Osaka, Japan; 2021; p. 173–174. https://doi.org/10.23919/ISAP47053.2021.9391487.
10. Ji S., Xu W., Yang M., Yu K.. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1):221–231. 2013;https://doi.org/10.1109/TPAMI.2012.59.
11. Klarenbeek G., Harmanny R. I. A., Cifola L.. Multi-target human gait classification using LSTM recurrent neural networks applied to micro-Doppler. In : Proceedings of 2017 European Radar Conference (EURAD). Nuremberg, Germany; 2017; p. 167–170. https://doi.org/10.23919/EURAD.2017.8249173.
12. Kim Y., Alnujaim I., Oh D.. Human activity classification based on point clouds measured by millimeter wave MIMO radar with deep recurrent neural networks. IEEE Sensors Journal 21(12):13522–13529. 2021;https://doi.org/10.1109/JSEN.2021.3068388.
13. Mathe E., Mitsou A., Spyrou E., Mylonas P.. Arm gesture recognition using a convolutional neural network. In : Proceedings of 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP). Zaragoza, Spain; 2018; p. 37–42. https://doi.org/10.1109/SMAP.2018.8501886.
14. Xie H., Han P., Li C., Chen Y., Zeng S.. Lightweight midrange arm-gesture recognition system from mmWave radar point clouds. IEEE Sensors Journal 23(2):1261–1270. 2023;https://doi.org/10.1109/JSEN.2022.3216676.
15. Chen Y. S., Cheng K. H.. BiCLR: radar–camera-based cross-modal bi-contrastive learning for human motion recognition. IEEE Sensors Journal 24(3):4102–4119. 2024;https://doi.org/10.1109/JSEN.2023.3344789.
16. Cao Z., Simon T., Wei S. E., Sheikh Y.. Realtime multi-person 2D pose estimation using part affinity fields. In : Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA; 2017; p. 1302–1310. https://doi.org/10.1109/CVPR.2017.143.
17. Rombach R., Blattmann A., Lorenz D., Esser P., Ommer B.. High-resolution image synthesis with latent diffusion models. In : Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, LA, USA; 2022; p. 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042.
18. Hirzi N. M., Ma’sum M. A., Pratama M., Jatmiko W.. Large-scale 3D point cloud semantic segmentation with 3D U-net ASPP sparse CNN. In : Proceedings of 2022 7th International Workshop on Big Data and Information Security (IWBIS). Depok, Indonesia; 2022; p. 59–64. https://doi.org/10.1109/IWBIS56557.2022.9924988.
19. Yang Y., Zhang Y., Lang Y., Li B., Guo S., Tan Q.. GAN-based radar spectrogram augmentation via diversity injection strategy. IEEE Transactions on Instrumentation and Measurement 72article no. 2502512. 2023;https://doi.org/10.1109/TIM.2022.3225060.
20. Zhang L., Rao A., Agrawala M.. Adding conditional control to text-to-image diffusion models. In : Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris, France; 2023; p. 3813–3824.

Biography

Wonhyo Kim, https://orcid.org/0009-0008-0486-497X is a graduate student pursuing his M.S. degree in the Department of Electronic Engineering, Sogang University, Seoul, Korea. His primary topic of research is target detection using machine-learning techniques. His other research interests include radar image processing for enhanced object detection and classification using deep learning methods.

Juho Cha, https://orcid.org/0009-0002-0788-3654 received is B.S. degree from Sogang University, Seoul, Republic of Korea, where he is currently pursuing his M.S. degree in the Department of Electronic Engineering. His primary topic of research is radar signal processing. His research interests also include radar image processing and deep learning applications for target detection.

Jiyeon Choi, https://orcid.org/0009-0002-8351-5474 received her B.S. degree in electronics and information engineering from Korea Aerospace University (KAU), Goyang-si, Gyeonggi-do, Republic of Korea, in 2023. She is currently pursuing her M.S. degree in the Department of Electronic Engineering, Sogang University, Seoul, South Korea. Her research interests include radar signal processing, the use of radar with artificial intelligence for target detection and tracking, and classification using deep learning applications.

Amin Hong, https://orcid.org/0009-0002-3256-5789 received his B.S. degree in electrical engineering from Sogang University, Seoul, South Korea, in 2024. He is currently pursuing his M.S. degree at the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. His primary topic of research is computer architecture and systems. His other areas of interest include machine learning applications for target detection.

Dami Ko, https://orcid.org/0009-0003-6377-8119 received her B.S. degree from the Department of Electronic Engineering at Sogang University, where she is currently pursuing her M.S. degree in the same field. Her research interests include deep learning and computer vision.

Youngwook Kim, https://orcid.org/0000-0002-4067-6254 (S’03-M’08-S’14) received his B.S. degree in electrical engineering from Seoul National University, Korea, in 2003, and his M.S. and Ph.D. degrees in electrical and computer engineering from the University of Texas at Austin, USA, in 2005 and 2008, respectively. He worked in the Department of Electrical and Computer Engineering at California State University, Fresno, from 2008 to 2021. Since 2022, he has been a professor at Sogang University, South Korea. His research interests include radar signal processing, antenna design, and RF electronics. His primary topic of research is radar target classification using machine-learning techniques. Currently, he is focusing on remote detection, monitoring, and analysis of human motions using EM waves along with deep learning algorithms. He is a recipient of the Provost Award for Research, Scholarship and Creative Accomplishment. He has also received the LCOE Outstanding Research Award, the Claude Laval Jr. Award, and Provost’s New Faculty Award from California State University. He is also a recipient of the A. D. Hutchison fellowship from the University of Texas at Austin, and the National IT Fellowship from the Ministry of Information and Communication, Korea. He has published over 100 technical papers.

Article information Continued

Fig. 1

Human point-cloud construction using a MIMO FMCW radar.

Fig. 2

Experimental environment: (a) horizontal, and (b) vertical.

Fig. 3

The six arm gestures: (a) Action [A]: Standing with right arm bent up at the side, (b) Motion [B]: Bending and straightening the left arm sideways, (c) Motion [C]: Bending and straightening both arms sideways, (d) Action [D]: Standing still with right arm at the side, (e) Motion [E]: Bending and straightening the right arm sideways, and (f) Motion [F]: Both arms up and down at the sides.

Fig. 4

2D point clouds for one frame of Motion [C].

Fig. 5

(a) Participant, (b) point cloud, (c) camera image along with the skeleton, and (d) skeleton extracted from (c).

Fig. 6

Composition of the 3D data constructed by accumulating 2D point clouds across time.

Fig. 7

The 3D-CNN process using point clouds.

Fig. 8

Evaluation of test accuracy based on varying parameters for 3D-CNN.

Fig. 9

The four angles, along with labels, considered for the 2D-CNN regression model.

Fig. 10

Procedure for estimating the four angles using the 2D-CNN regression model.

Fig. 11

(a) Point clouds used as input to the regression model for testing Motion [F]; (b) Skeleton reconstruction obtained as output of the 3D-CNN model depicting the four angles.

Fig. 12

Classification process using 3D-CNN based on the reconstructed human model.

Fig. 13

(a) Point cloud image, and (b) augmented point cloud image in a frame.

Fig. 14

Examples of input and output images for training: (a) input and (b) output.

Fig. 15

Structure of 2D-CNN-based AE.

Fig. 16

(a) Input point clouds included in training, (b) input point clouds not included in training for test, (c) output of the CNN-based autoencoder for (a), (d) output of the CNN-based autoencoder for (b), (e) true skeleton image obtained from photos of (a).

Fig. 17

Comparison of test accuracies of the final models.

Table 1

Structure of the 3D-CNN classification model for point clouds

3D-CNN structure Parameter
1 Input layer 64×64×20×1
2 3D convolution layer + ReLU layer Filter size 8×8×1, number of filters 4
3 Max pooling layer Filter size 2×2×1, stride of 2
4 3D convolution layer + ReLU layer Filter size 8×8×1, number of filters 8
5 Max pooling layer Filter size 2×2×1, stride of 2
6 Dropout layer 30%
7 Fully connected layer 6
8 Softmax layer + Classification layer -

Table 2

Confusion matrix using point clouds with 1-fold validation

True Est

Action [A] Motion [B] Motion [C] Action [D] Motion [E] Motion [F]
Action [A] 463 0 15 0 0 2
Motion [B] 0 480 0 0 0 0
Motion [C] 0 0 480 0 0 0
Action [D] 0 57 0 423 0 0
Motion [E] 0 85 0 46 349 0
Motion [F] 0 2 78 4 0 396

Table 3

Structure of the 2D-CNN regression model

2D-CNN structure Parameters
1 Input layer 64×64×1
2 2D convolution layer + ReLU layer Filter size 8×8, 4 filters
3 Max pooling layer Filter size 2×2, stride of 1
4 2D convolution layer + ReLU layer Filter size 5×5, 8 filters
5 Max pooling layer Filter size 2×2, stride of 1
6 2D convolution layer + ReLU layer Filter size 5×5, 16 filters
7 Max pooling layer Filter size 2×2, stride of 16
8 2D convolution layer + ReLU layer Filter size 4×2, 16 filters
9 Dropout layer 30%
10 Fully connected layer 4
11 Clipped ReLU layer Max of 180
12 Regression layer

Table 4

Confusion matrix using 2D-CNN-based skeleton with 1-fold validation

True Est

Action [A] Motion [B] Motion [C] Action [D] Motion [E] Motion [F]
Action [A] 477 0 0 0 0 3
Motion [B] 0 480 0 0 0 4
Motion [C] 0 1 428 0 51 0
Action [D] 0 28 0 434 18 0
Motion [E] 0 8 49 0 423 0
Motion [F] 0 0 0 0 4 476

Table 5

Structure of the 2D-CNN-based AE model

2D-CNN structure Parameters
1 Input layer 64×64×1
2 2D convolution layer + ReLU layer Filter size 3×3, 8 filters
3 Max pooling layer Filter size 2×2, stride of 2
4 2D convolution layer + ReLU layer Filter size 3×3, 16 filters
5 Max pooling layer Filter size 2×2, stride of 2
6 2D convolution layer + ReLU layer Filter size 3×3, 32 filters
7 Max pooling layer Filter size 2×2, stride of 2
8 2D convolution layer + ReLU layer Filter size 3×3, 64 filters
9 Max pooling layer Filter size 2×2, stride of 2
10 Transposed 2D convolution layer + ReLU layer Filter size 2×2, 64 filters
11 Transposed 2D convolution layer + ReLU layer Filter size 2×2, 32 filters
12 Transposed 2D convolution layer + ReLU layer Filter size 2×2, 16 filters
13 Transposed 2D convolution layer + ReLU layer Filter size 2×2, 8 filters
14 2D convolution layer + ReLU layer Filter size 3×3, 1 filter
15 Regression layer

Table 6

Structure of the 3D-CNN classification model

3D-CNN structure Parameters
1 Input layer 64×64×20×1
2 3D convolution layer + ReLU layer Filter size 8×8×1, 1 filter
3 Max pooling layer Filter size 2×2×6, stride of 1
4 Fully connected layer 6
5 Softmax layer + Classification layer -

Table 7

Confusion matrix using the autoencoder-based skeleton with 1-fold validation

True Est

Action [A] Motion [B] Motion [C] Action [D] Motion [E] Motion [F]
Action [A] 480 0 0 0 0 0
Motion [B] 0 476 4 0 0 0
Motion [C] 0 0 466 0 0 14
Action [D] 0 28 0 480 0 0
Motion [E] 0 0 1 14 465 0
Motion [F] 0 0 0 0 0 480