DOI QR코드

DOI QR Code

Intelligent Activity Recognition based on Improved Convolutional Neural Network

  • Park, Jin-Ho (Dept. of Information and Communication Engineering, Graduate School, Tongmyong University) ;
  • Lee, Eung-Joo (Department of Information & Communications Engineering, Tongmyong University)
  • 투고 : 2022.05.06
  • 심사 : 2022.05.31
  • 발행 : 2022.06.30

초록

In order to further improve the accuracy and time efficiency of behavior recognition in intelligent monitoring scenarios, a human behavior recognition algorithm based on YOLO combined with LSTM and CNN is proposed. Using the real-time nature of YOLO target detection, firstly, the specific behavior in the surveillance video is detected in real time, and the depth feature extraction is performed after obtaining the target size, location and other information; Then, remove noise data from irrelevant areas in the image; Finally, combined with LSTM modeling and processing time series, the final behavior discrimination is made for the behavior action sequence in the surveillance video. Experiments in the MSR and KTH datasets show that the average recognition rate of each behavior reaches 98.42% and 96.6%, and the average recognition speed reaches 210ms and 220ms. The method in this paper has a good effect on the intelligence behavior recognition.

키워드

1. INTRODUCTION

Behavior recognition based on machine vision is to analyze and recognize the behavior of people in video images, and respond to specific behaviors in it, such as punching, running and other behaviors in a timely manner, so that the monitoring personnel can process them. It is a kind of intelligent monitoring system. It is widely used in various fields such as security monitoring, human-computer intelligent interaction, and virtual reality [1].

In recent years, many scholars have conducted different researches on the problem of vision-based behavior recognition, and behavior recognition technology has become a hot direction in computer vision research. Ng et al. [2] modeled the video using LSTM, which concatenated the output of the underlying CNN as the input for the next moment, and achieved a recognition rate of 82.6% on the UCF101 database. Ullah et al. [3] proposed a network combining CNN and deep bidirectional LSTM (DB-LSTM) to process video data, using the DB-LSTM network to learn the order information between frame features, and deal with lengthy video sequences by analyzing the features of specific time intervals. Donahue et al. [4] proposed a long-term recurrent convolutional neural network (LRCN), which combined CNN and LSTM to extract features from video data, and obtained features from a single frame of image information through CNN, the output of the CNN is then passed through the LSTM in chronological order. In this way, the video data is finally characterized in spatial and temporal dimensions, and an average recognition rate of 82.92% is obtained on the UCF101 database.

However, there are still two main challenges in the existing mainstream methods mentioned above: 1) Extraction of target features; 2) The speed and real-time nature of the overall process of behavior recognition. At present, most of the mainstream methods use CNN to extract depth features. CNN itself is computationally complex, and most of the areas in the video stream are not target images, so the feature extraction of the entire image will undoubtedly cost more; However, target detection algorithms such as motion foreground extraction and optical flow method are not real-time and stable, and are easily affected by external environmental conditions, such as illumination, camera angle and distance, and increase the amount of calculation and reduce time efficiency [5].

Traditional action recognition methods are mainly based on artificially designed features, focusing on designing powerful feature descriptors, such as Histogram of Gradients (HOG), Histogram of Optical Flow (HOF), and Motion History Image (MHI), etc. [6]. Li et al. [7] represent a series of poses of the human body by extracting representative 3D bags of words (BOPs) in videos, then, a human behavior graph is constructed with BOPs as points, and human behavior recognition is performed by calculating the probability of each path on the behavior graph. Deep learning can automatically extract the multi-layer feature representation hidden in the data, and has a strong representation ability. Given this advantage, Chéron et al. [8] used single-frame depth features and optical flow data to capture motion information, and then designed a multi-resolution convolutional neural network for behavior classification. Fan Heng et al. [9] extracted the target motion foreground through a Gaussian mixture model, then established a sample library for various target behaviors in the training sample set, defined different categories of behaviors as prior knowledge, and trained a deep network model for behavior recognition. Tu et al. [10] proposed an MSR-CNN algorithm to improve object detection technology by extracting features from motion salient regions (MSRs), achieving accurate behavior recognition with less training data

Aiming at the shortcomings of the above existing methods, a human behavior recognition algorithm based on YOLO, combined with LSTM and CNN is proposed. First, the specific behavior in the surveillance video is detected in real time, and the information such as the size and location of the target is obtained, and then the feature extraction is performed to remove the noise data in the irrelevant area in the image. The computational complexity of feature extraction and the time complexity of behavior recognition are further reduced. The experimental results in the public behavior recognition datasets KTH and MSR show that the method in this paper can effectively perform behavior recognition in intelligent monitoring scenarios.

2. YOLO-LSTM-CONVOLUTIONAL NEURAL NETWORK

The YOLO-LSTM-CNN algorithm is mainly composed of three parts: target detection, feature extraction and behavior discrimination. The overall structure and process are shown in Fig. 1.

Fig. 1. YOLO-LSTM-CNN algorithm flow chart.

First, select the corresponding behavior data set for training according to the user-defined behavior category. After the training is completed, the YOLO model can perform fast and real-time target detection on each frame of the video stream. The frame image is framed out of the target area, the traditional CNN model is used to extract the target features, and finally the feature vector of the continuous action sequence is added to the LSTM for final behavior discrimination.

2.1 YOLO target detection

Target detection is to extract moving foreground or target of interest from video or image. Among them, YOLO is a deep learning target detection technology based on regression method [11]. It integrates target region prediction and target category prediction into a single neural network model. In the testing phase, the entire image is input into the model at one time, and the prediction result combines the global information of the image. At the same time, the model only uses a network calculation to make predictions, therefore, compared with traditional target detection algorithms such as optical flow method and background subtraction method, and algorithms such as R-CNN and Fast R-CNN based on deep learning, it is many times faster. It realizes rapid target detection and recognition at a speed of 45 (frames/ second) with high accuracy, which is more suitable for real application environments.

Overall, the Yolo algorithm uses a separate CNN model to achieve end-to-end target detection. The entire system is shown in Fig. 2: first, the input image is resized to 448x448, and then sent to the CNN network. Finally, the network prediction result is processed to obtain the detected target. Compared with the R-CNN algorithm, it is a unified framework, which is faster, and the training process of Yolo is also end-to-end.

Fig. 2. YOLO detection system.

Specifically, Yolo's CNN network divides the input image into grids, and then each cell is responsible for detecting those targets whose center points fall within the grid. As shown in Fig. 3, it can be seen that the center of the person's target falls within a cell in the lower left corner, then this cell is responsible for predicting this person. Each cell predicts B bounding boxes and a confidence score for the bounding box. The so-called confidence actually includes two aspects, one is the possibility that the bounding box contains the target, and the other is the accuracy of the bounding box. The former is denoted as Pr(object), and when the bounding box is the background (ie, does not contain the target), then Pr(object)=0. And when that bounding box contains the target, Pr(object)=1. The accuracy of the bounding box can be characterized by the IOU (intersection over union) of the predicted box and the ground truth, denoted as \(I O U_{p r e d}^{\text {truth }}\) . So the confidence can be defined as Pr(object)*\(I O U_{p r e d}^{\text {truth }}\). Many people may regard Yolo's confidence as the probability of whether the bounding box contains the target, but in fact it is the product of two factors, and the accuracy of the prediction box is also reflected in it. The size and position of the bounding box can be characterized by 4 values: (x,y,w,h), where (x,y) is the center coordinate of the bounding box, w and h are the width and height of the bounding box. Another point to note is that the predicted value (x,y) of the center coordinate is the offset value relative to the coordinate point of the upper left corner of each cell, and the unit is relative to the size of the cell. The coordinates of the cell are defined as shown in Fig. 3. The w and h prediction values of the bounding box are relative to the ratio of the width and height of the entire picture, so theoretically the size of the 4 elements should be in the [0,1] range. In this way, the predicted value of each bounding box actually contains 5 elements: (x,y,w,h,c), where the first 4 represent the size and position of the bounding box, and the last value is the confidence.

Fig. 3. Meshing.

Fig. 4. YOLO model predicted value structure

There is also a classification problem. For each cell, it also gives a probability value of predicting  categories, which represents the probability that the target of the bounding box predicted by the cell belongs to each category. But these probability values are actually conditional probabilities under the confidence of each bounding box, namely \(\operatorname{Pr}\left(\text { class }_{i} \mid \text { object }\right)\). It is worth noting that no matter how many bounding boxes a cell predicts, it only predicts a set of class probability values, which is a disadvantage of the Yolo algorithm, in a later improved version, Yolo9000 binds the category probability prediction value to the bounding box. At the same time, we can calculate the class-specific confidence scores of each bounding box:

\(\begin{gathered} \operatorname{Pr}\left(\text { class }_{i} \mid \text { object }\right) \times \operatorname{Pr}(\text { object }) \times I O U_{\text {pred }}^{\text {truth }} \\ =\operatorname{Pr}\left(\text { class }_{i}\right) \times I_{\text {oted }}^{\text {pred }} \end{gathered}\)       (1)

The confidence of the bounding box category represents the probability that the object in the bounding box belongs to each category and how well the bounding box matches the object.

2.2 LSTM architecture

The full name of LSTM is Long Short Term Memory, as the name suggests, it has the ability to memorize long and short term information neural network. LSTM was first proposed by Hochreiter & Schmidhuber in 1997. Due to the rise of deep learning in 2012, LSTM has undergone continuous development, thus forming a relatively systematic and complete LSTM framework, which has been widely used in many fields.

The motivation of LSTM is to solve the long-term dependency problem we mentioned above. Traditional RNN node outputs are determined only by weights, biases, and activation functions (Fig. 5). RNN is a chain structure, each time slice uses the same parameters.

Fig. 5. RNN unit

The reason why LSTM can solve the long-term dependency problem of RNN is that LSTM introduces a gate mechanism to control the circulation and loss of features. For the above example, LSTM can pass the features of time \(t_{2}\) at time \(t_{9}\) , so that it can be very effective to judge whether to use singular or plural at time \(t_{9}\) . LSTM is composed of a series of LSTM units, and its chain structure is shown in Fig. 6.

Fig. 6. LSTM unit.

The core part of LSTM is the part similar to the conveyor belt at the top. This part is generally called the cell state, which exists in the entire chain system of LSTM from beginning to end.

\(C_{t}=f_{t} \times C_{t-1}+i_{t} \times \widetilde{C_{t}}\)       (2)

where ft is called the forget gate, indicating which features of t-1 are used to calculate C· ft . is a vector, and each element of the vector lies within the range of [0,1]. Usually we use sigmoid as the activation function, the output of sigmoid is a value in the interval [0,1]. But when you look at a trained LSTM, you will see that the gates are overwhelmingly very close to 0 or 1, and the rest are few and far between. Where ⊗ is the most important gate mechanism of LSTM, representing the unit multiplication relationship between fand Ct-1.

\(f_{t}=\sigma\left(W_{f} \cdot\left[h_{t-1}, x_{t}\right]+b_{f}\right)\)       (3)

\(\widetilde{C}_{t}\) represents the unit state update value, which is obtained from the input data xt and the hidden node \(h_{t-1}\) through a neural network layer. The activation function of the unit state update value usually uses tanh. \(i_{t}\) is called the input gate. Like \(f_{t}\), it is also a vector whose elements are in the interval of [0,1] . It is also calculated by \(x_{t}\) and \(h_{t-1}\) through the sigmoid activation function

\(i_{t}=\sigma\left(W_{i} \cdot\left[h_{t-1}, x_{t}\right]+b_{i}\right)\)       (4)

\(\tilde{C}_{t}=\tanh \left(W_{c} \cdot\left[h_{t-1}, x_{t}\right]+b_{C}\right)\)       (5)

\(i_{t}\) is used to control which features of \(\widetilde{C}_{t}\) are used to update \(C_{t}\), in the same way as \(f_{t}\).

\(C_{t}=f_{t} \times C_{t-1}+i_{t} \times \widetilde{C}_{t}\)       (6)

Finally, in order to calculate the predicted value \(\widetilde{y}_{t}\) and generate the complete input for the next time slice, we need to calculate the output \(h_{t}\) of the hidden node.

\(o_{t}=\sigma\left(W_{o} \cdot\left[h_{t-1}, x_{t}\right]+b_{o}\right)\)       (7)

\(h_{t}=o_{t} \times \tanh \left(C_{t}\right)\)       (8)

\(h_{t}\) is obtained from output gate \(o_{t}\) and cell state \(C_{t}\), where \(o_{t}\) is calculated in the same way as \(f_{t}\) and \(i_{t}\). By initializing the mean of \(b_{o}\) to 1, the LSTM can be approximated to the GRU [12]

2.3 YOLO-LSTM-CNN algorithm

The YOLO-LSTM-CNN algorithm uses YOLO target detection as an auxiliary method and adds it to the mainstream method system of LSTMCNN. Combining the speed and real-time performance of YOLO target detection with the advantages of LSTM for long-term sequence processing, the YOLO algorithm based on regression method can quickly detect specific behavior frames, and then combine CNN to extract target depth features, while LSTM can avoid gradient disappearance, through the modeling processing of time series, continuous action frames can be accurately discriminated, and finally the accuracy and time efficiency of action recognition can be effectively improved. The framework of the YOLO-LSTMCNN algorithm is shown in Fig. 7.

Fig. 7. YOLO-LSTM-CNN frame diagram.

In Fig. 7, (x,y) represents the center coordinates of the bounding box of the detection target, (w,h) is the corresponding width and height of the bounding box, c is confidence, representing the confidence of the detection target; \(x_{t}\) represents the extracted depth feature vector.

2.3.1 Model building

Both YOLO target detection and CNN feature extraction models need to intercept images of different behaviors in the video as a training set, and frame the target location and size range. YOLO’s fully connected layers regress feature representations into region predictions, which are encoded as a vector of size S×S×(B×5+C). It means that the image is divided into S×S regions, each divided region has B predicted bounding boxes, and each bounding box is represented by its 5 position parameters, including the bounding box center coordinate (x,y), width and height (w,h) and confidence c.

Once the exact information such as target size, location, etc. is detected, the traditional CNN model can be used for deep feature extraction. CNN takes video frames as its input and produces a feature map of the entire image. In this paper, the VGG Net-16 network is selected as the training model for general feature learning; The front-end (closer to the input image) layer in the CNN network extracts basic features such as texture and color. The closer to the back-end, the more advanced, abstract, and task-oriented the extracted features. So first use 1000 categories of ImageNet data to learn convolution weights, so that the network has a generalized understanding of multiple categories of visual objects, then perform fine-tune based on the behavior training set KTH, retrain the parameters of the last few layers of the network, and then use a smaller learning rate to train the network as a whole, so that it can also learn behavioral features well.

Finally, input the feature vector sequence into the LSTM model, use the forward algorithm to calculate the derivative of each objective function with respect to the weight, and then enter the LSTM unit, the network is trained using backpropagation (BPTT) and real-time recursive learning (RTRL) gradient descent algorithms, and finally the product of the probabilities of the action sequences of all training samples over the corresponding eigenvalue sequences is maximized.

2.3.2 Recognition process

The single action recognition process of the YOLO-LSTM-CNN algorithm is shown in Fig. 8:

Fig. 8. Recognition process of a single behavior.

1) When the occurrence of a specific behavior frame is detected, YOLO is first used to extract the position and confidence information, and its speed of 45 frames/s can realize real-time detection of surveillance video; Under the training of a large number of datasets, the correct rate of YOLO's behavior detection can reach more than 90%.

2) On the basis of target detection, obtain and retain the image content of the target range, and remove the noise data interference of the remaining background parts, so as to extract complete and accurate target features. Fig. 13 shows the visualization of convolution features. The VGGNet-16 model is used to extract 4 096-dimensional deep feature vectors, which are combined with the target size and position information (x, y, w, h, c) predicted by YOLO and fed back to the recognition module.

3) Using LSTM unit as a recognition module, different from standard RNN, LSTM architecture uses memory unit to store and output information, allowing it to better discover the temporal relationship of multiple target actions, and finally output the behavior category of the entire action sequence.

3. EXPERIMENTAL RESULTS AND ANALYSIS

3.1 Data set

1) The MSR dataset contains a total of 16 video sequences with 63 actions: 14 clapping, 24 waving, and 25 boxing, completed by 10 experimenters; Each sequence contains multiple types of actions, some sequences contain actions performed by different people, divided into indoor and outdoor scenes; All video sequences were captured with cluttered and moving backgrounds, each video size was 320 × 240 pixels, the frame rate was 15 frames/ s, and the length was between 32 s and 76 s.

2) The KTH dataset includes 6 human actions: Walking, Jogging, Running, Boxing, Hand waving, and Hand clapping, which are performed by 25 different people in 4 scenarios, with a total of 2391 video sequences. All sequences were downsampled to a resolution of 160 × 120 pixels, with an average length of 4 s, and divided into 1 training set (8 individuals), 1 validation set (8 individuals) and 1 testing set (9 individuals).

3.2 Model training

The overall training process of the YOLO-LSTMCNN algorithm In the popular Caffe framework, the target detection model follows the YOLO architecture, setting S=7, B=2, C=20; The feature extraction adopts the pre-trained model VGGNet on ImageNet, and fine-tunes the network based on the behavior data set. During training, the batch-size is set to 256 and the momentum is set to 0.9; After fine-tuning is done, different layer features will be cached for further using.

The next step is to collect the extracted convolutional features. The fully connected layer is a fixed-size 4096-dimensional vector, and the feature vector is fed back to the LSTM unit. Then a Softmax classifier is trained to classify the actions, the Softmax is concatenated after the LSTM, and the output size is the number of action categories. The RNN is trained using back-propagation (BPTT) and real-time recursive learning (RTRL) gradient descent algorithms, with batch set to 64, momentum set to 0.9, and base learning rate set to 0.01, for every 20,000 iterations, the learning rate is multiplied by 0.1, and the model starts to converge after 50,000 training iterations.

3.3 Experimental results

Fig. 9 shows the accuracy of YOLO detection for various behaviors in the MSR and KTH datasets.

Fig. 9. Detection accuracy of each behavior in MSR(a) and KTH(b) datasets.

For individual behaviors such as punching and waving, due to the large movement range and obvious behavior, the false detection rate is within 10%, which is close to the accuracy of the final behavior recognition. This shows that the recognition rate of YOLO behavior detection is maintained at a high level, and the subsequent LSTM modeling of time series can further eliminate false detections.

Fig. 10 shows the confusion matrix of the behavior recognition of the two datasets by the method in this paper. The row represents the correct category, and the column represents the classification result of the algorithm. As can be seen from Fig. 16, the degree of confusion between the behaviors in the two datasets is low, and the two behaviors of boxing and waving are basically not confused with other behaviors; Therefore, due to the similarity of the actions of the behaviors and the certain interference in different scenarios, there is a slight confusion between waving and clapping, walking, jogging and running. The method in this paper performs precise positioning based on the interest points of the human body area detected by YOLO, which can pre-determine actions and eliminate redundant background noise.

Fig. 10. Confusion matrix of each behavior in MSR(a) and KTH(b) datasets.

Tables 1 and 2 show the comparison results of the behavior recognition rates of the algorithms used in different literatures and the algorithm in this paper on the KTH dataset and the MSR dataset, respectively. It can be seen from Table 1 that for the three behaviors in the MSR data set, the recognition rate of the algorithm in this paper is higher than that of the other four algorithms in punching and clapping, which are increased by 0.75% and 1.17% respectively; It can be seen from Table 2 that for the 6 behaviors in the KTH dataset, the recognition rate of punching is the highest, and the recognition rate is the lowest due to the small amplitude of slow walking and easy confusion with running, walking and other actions. Compared with other methods, punching, running and slow walking have increased by 0.42%, 1.96% and 1.25% respectively, and the other three behavior recognition rates are basically the same as the literature methods.

Table 1. Comparisons with other methods in the MSR dataset.

Table 2. Comparisons with other methods in the KTH dataset.

The average recognition speed of the YOLOLSTM-CNN algorithm is 210 ms and 220 ms, respectively, which is significantly higher than other algorithms. The real-time target and behavior detection speed not only improves the overall recognition speed of the model. Moreover, the training of network parameters is more effective, and a deep learning model with better performance and higher recognition accuracy is finally constructed. This proves the effectiveness of the algorithm in this paper.

4. CONCLUSION

A human action recognition algorithm based on YOLO algorithm combined with LSTM and CNN is proposed, which utilizes the rapidity and realtime nature of YOLO target detection to detect specific actions in intelligent video surveillance in real time. And remove the noise data of irrelevant areas in the image, combined with LSTM modeling processing of long-term sequences, can quickly detect and recognize behaviors in video surveillance, and reduce the computational complexity and time complexity of behavior recognition. Through the experiments in the public behavior recognition data sets KTH and MSR, the average recognition rate of each behavior reaches 96.6%, which shows the effectiveness of the method in this paper and can be applied to most security fields such as intelligent monitoring with high real-time requirements and complex scenes.

Due to the complex network structure of the YL-CNN algorithm model, multiple models such as detection, feature extraction, and LSTM need to be trained. Training is time-consuming and difficult. The optimization of the model training strategy and the adaptation of more application scenarios will be the direction of the next work and efforts.

참고문헌

  1. G. Stavropoulos, D. Giakoumis, K. Moustakas, and D. Tzovaras, "Automatic Action Recognition for Assistive Robots to Support MCI Patients at Home," Proceedings of the 10th International Conference on P ervasive Technologies Related to Assistive Environ- ments, pp. 366-371, 2017.
  2. J.Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, "Beyond Short Snippets: Deep Networks for Video Classification," Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. pp. 4694-4702, 2015.
  3. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S.W. Baik, "Action Recognition in Video Sequences Using Deep Bi-directional LSTM with CNN Features," IEEE Access, Vol. 6, pp. 1155-1166, 2017. https://doi.org/10.1109/access.2017.2778011
  4. J. Donahue, L.A. Hendricks, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-Term Recurrent Convolutional Networks for Visual Recognition and Description," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, No. 4, pp. 677-691, 2016. https://doi.org/10.1109/TPAMI.2016.2599174
  5. G. Yu and T. Li, "Recognition of Human Continuous Action with 3D CNN," Proceedings of the 11th International Conference on Computer Vision Systems, pp. 314-322, 2017.
  6. A.B. Mahjoub and M. Atri, "Human Action Recognition Using RGB Data," The 11th International Design & Test Symposium, Hammamet, pp. 83-87, 2017.
  7. L. Wanqing, Z. Zhengyou, and L. Zicheng, "Action Recognition Based on a Bag of 3D Points," Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 9-14, 2010.
  8. G. Cheron, I. Laptev, and C. Schmid, "P-CNN: Pose-Based CNN Features for Action Recognition," Proceedings of 2015 IEEE International Conference on Computer Vision, pp. 3218-3226, 2015.
  9. F. Heng, "Human Behavior Recognition Based on Deep Learning," Journal of Wuhan University Information Science Edition, Vol. 41, No. 4, pp. 492-497, 2016.
  10. T. Zhigang, C. Jun, L. Yikang, and L. Baoxin, "MSR-CNN: Applying Motion Salient Region Based Descriptors for Action Recognition," Proceedings of the 23rd International Conference on Pattern Recognition, pp. 3524-3529, 2016.
  11. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," Proceedings of 2016 IEEE Computer Vision and Pattern Recognition, pp. 779-788, 2016.
  12. K. Greff, R.K. Srivastava, J. Koutnik, B.R. Steunebrink, and J. Schmidhuber, "LSTM: A Search Space Odyssey," IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, No. 10, pp. 2222-2232, 2016. https://doi.org/10.1109/TNNLS.2016.2582924
  13. E.P. Ijjina and K.M. Chalavadi, "Human Action Recognition in RGB-D Videos Using Motion Sequence Information and Deep Learning," Pattern Recognition, Vol. 72, pp. 504-516, 2017. https://doi.org/10.1016/j.patcog.2017.07.013
  14. L. Jun, W. Gang, D. Ling-Yu, K. Abdiyeva, and A.C. Kot, "Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks," IEEE Transactions on Image Processing, Vol. 27, No. 4, pp. 1586-1599, 2017. https://doi.org/10.1109/TIP.2017.2785279
  15. S. Megrhi, M. Jmal, W. Souidene, and A. Beghdadi, "Spatio-Temporal Action Localization and Detection for Human Action Recognition in Big Dataset," Journal of Visual Communication and Image Representation, Vol. 41, pp. 375-390, 2016. https://doi.org/10.1016/j.jvcir.2016.10.016
  16. A.B. Sargano, W. Xiaofeng, P. Angelov, and Z. Habib, "Human Action Recognition Using Transfer Learning with Deep Representations," Proceedings of 2017 International Joint Conference on Neural Networks, Anchorage, pp. 463-469, 2017.
  17. Z. Ning, J.-H. Park, and E.-J. Lee, "Multi-Human Behavior Recognition Based on Improved Posture Estimation Model," Journal of Korea Multimedia Society, Vol. 24, No. 5, pp. 659-666, 2021. https://doi.org/10.9717/KMMS.2021.24.5.659