Development of a Method to Detect Humans and Their Motion Based On a Self-Supervised Learning Method: A Comprehensive Review
Keywords:
Self-Supervised Learning, Human Motion Detection, Optical Flow, Human Pose Estimation, Action Recognition, Gait Analysis, Computer Vision, Deep Learning, World-Local Flows, Joint Constraint LearningAbstract
This comprehensive review paper examines the recent advancements in human detection and motion analysis using self-supervised learning methods. Traditional approaches for human motion understanding have largely relied on supervised learning techniques that require extensive labeled datasets, which are costly and time-consuming to acquire. The emergence of self-supervised learning paradigms has revolutionized this field by enabling models to learn meaningful representations from unlabeled data. This review systematically analyzes the evolution from traditional optical flow methods and pose-based approaches to modern self-supervised frameworks, with particular focus on the groundbreaking H-MoRe (Human Motion Representation) method. We explore the technical foundations, architectural innovations, and performance benchmarks across multiple applications including gait recognition, action recognition, and video generation. The comprehensive analysis covers performance metrics, comparative evaluations, and practical implementations across diverse datasets including CASIA-B, Diving48, and UTD-MHAD. The paper also discusses current challenges, future research directions, and the potential impact of these methods on various real-world applications in healthcare, surveillance, sports analytics, and human-computer interaction.
Downloads
References
J. K. Aggarwal and M. S. Ryoo, "Human activity analysis: A review," ACM Computing Surveys, vol. 43, no. 3, pp. 1-43, 2011.
T. B. Moeslund, A. Hilton, and V. Krüger, "A survey of advances in vision-based human motion capture and analysis," Computer Vision and Image Understanding, vol. 104, no. 2-3, pp. 90-126, 2006.
D. J. Fleet and Y. Weiss, "Optical flow estimation," in Handbook of Mathematical Models in Computer Vision, Springer, 2006, pp. 237-257.
C. S. Poon and D. Y. Huang, "Human motion analysis: A review," in Proceedings of IEEE International Conference on Computer Vision Workshops, 2011, pp. 123-130.
J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308.
Y. LeCun, "A path towards autonomous machine intelligence," Open Review, 2022.
L. Jing and Y. Tian, "Self-supervised visual feature learning with deep neural networks: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4037-4058, 2020.
N. Devi, "H-MoRe: A self-supervised human-centric motion representation framework," M.Tech Thesis, I.K. Gujral Punjab Technical University, 2025.
R. Poppe, "A survey on vision-based human action recognition," Image and Vision Computing, vol. 28, no. 6, pp. 976-990, 2010.
A. J. Davison, "FutureMapping: The computational structure of spatial AI systems," arXiv preprint arXiv:1803.11288, 2018.
X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, "Self-supervised learning: Generative or contrastive," IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 1, pp. 857-876, 2021.
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, "Context encoders: Feature learning by inpainting," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536-2544.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, "Momentum contrast for unsupervised visual representation learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729-9738.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in Neural Information Processing Systems, 2014, pp. 2672-2680.
A. v. d. Oord, Y. Li, and O. Vinyals, "Representation learning with contrastive predictive coding," arXiv preprint arXiv:1807.03748, 2018.
B. K. P. Horn and B. G. Schunck, "Determining optical flow," Artificial Intelligence, vol. 17, no. 1-3, pp. 185-203, 1981.
Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, "OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172-186, 2019.
H. Wang and C. Schmid, "Action recognition with improved trajectories," in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3551-3558.
K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems, 2014, pp. 568-576.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "MobileNets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, "Towards accurate generative models of video: A new metric & challenges," arXiv preprint arXiv:1812.01717, 2018.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, "MobileNetV2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510-4520.
B. D. Lucas and T. Kanade, "An iterative image registration technique with an application to stereo vision," in Proceedings of the 7th International Joint Conference on Artificial Intelligence, 1981, pp. 674-679.
T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, "High accuracy optical flow estimation based on a theory for warping," in European Conference on Computer Vision, 2004, pp. 25-36.
D. Sun, S. Roth, and M. J. Black, "Secrets of optical flow estimation and their principles," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2432-2439.
Y. Yang and D. Ramanan, "Articulated human detection with flexible mixtures of parts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2878-2890, 2013.
M. Andriluka, S. Roth, and B. Schiele, "Pictorial structures revisited: People detection and articulated pose estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1014-1021.
A. Newell, K. Yang, and J. Deng, "Stacked hourglass networks for human pose estimation," in European Conference on Computer Vision, 2016, pp. 483-499.
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, "3D human pose estimation in video with temporal convolutions and semi-supervised training," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7753-7762.
S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 7444-7452.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2012, pp. 1097-1105.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, "A closer look at spatiotemporal convolutions for action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6450-6459.
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal segment networks: Towards good practices for deep action recognition," in European Conference on Computer Vision, 2016, pp. 20-36.
C. Feichtenhofer, H. Fan, J. Malik, and K. He, "SlowFast networks for video recognition," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6202-6211.
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, "DeCAF: A deep convolutional activation feature for generic visual recognition," in International Conference on Machine Learning, 2014, pp. 647-655.
L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, "Towards good practices for very deep two-stream convnets," arXiv preprint arXiv:1507.02159, 2015.
I. Misra, C. L. Zitnick, and M. Hebert, "Shuffle and learn: Unsupervised learning using temporal order verification," in European Conference on Computer Vision, 2016, pp. 527-544.
[39] N. Srivastava, E. Mansimov, and R. Salakhudinov, "Unsupervised learning of video representations using LSTMs," in International Conference on Machine Learning, 2015, pp. 843-852.
A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba, "Ambient sound provides supervision for visual learning," in European Conference on Computer Vision, 2016, pp. 801-816.
S. Gidaris, P. Singh, and N. Komodakis, "Unsupervised representation learning by predicting image rotations," in International Conference on Learning Representations, 2018.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations," in International Conference on Machine Learning, 2020, pp. 1597-1607.
T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, "Big self-supervised models are strong semi-supervised learners," in Advances in Neural Information Processing Systems, 2020, pp. 22243-22255.
J. B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., "Bootstrap your own latent: A new approach to self-supervised learning," in Advances in Neural Information Processing Systems, 2020, pp. 21271-21284.
D. P. Kingma and M. Welling, "Auto-encoding variational bayes," in International Conference on Learning Representations, 2014.
P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol, "Extracting and composing robust features with denoising autoencoders," in International Conference on Machine Learning, 2008, pp. 1096-1103.
D. J. Rezende, S. Mohamed, and D. Wierstra, "Stochastic backpropagation and approximate inference in deep generative models," in International Conference on Machine Learning, 2014, pp. 1278-1286.
A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," in International Conference on Learning Representations, 2016.
T. Karras, S. Laine, and T. Aila, "A style-based generator architecture for generative adversarial networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401-4410.
D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, "Temporal cycle-consistency learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1801-1810.
P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, "Time-contrastive networks: Self-supervised learning from video," in Proceedings of the IEEE International Conference on Robotics and Automation, 2018, pp. 1134-1141.
D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, "Learning and using the arrow of time," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8052-8060.
D. Jayaraman and K. Grauman, "Learning image representations tied to ego-motion," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1413-1421.
Z. Teed and J. Deng, "RAFT: Recurrent all-pairs field transforms for optical flow," in European Conference on Computer Vision, 2020, pp. 402-419.
X. Shi, Z. Huang, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, "FlowFormer++: Masked cost volume autoencoding for pretraining optical flow estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 1599-1610.
H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, "GMFlow: Learning optical flow via global matching," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8121-8130.
S. Yang, L. Zhang, Y. Liu, Z. Jiang, and Y. He, "Video diffusion models with local-global context guidance," in Proceedings of the International Joint Conference on Artificial Intelligence, 2023.
X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand, "U2-net: Going deeper with nested U-structure for salient object detection," Pattern Recognition, vol. 106, p. 107404, 2020.
J. Yang, A. Zeng, S. Liu, F. Li, R. Zhang, and L. Zhang, "Explicit box detection unifies end-to-end multi-person pose estimation," in International Conference on Learning Representations, 2022.
S. Zheng, J. Zhang, K. Huang, R. He, and T. Tan, "Robust view transformation model for gait recognition," in Proceedings of the IEEE International Conference on Image Processing, 2011, pp. 2073-2076.
CASIA Gait Database. [Online]. Available: http://www.cbsr.ia.ac.cn/english/Gait%20Databases.asp
C. Fan, Y. Peng, C. Cao, X. Liu, S. Hou, J. Chi, Y. Huang, Q. Li, and Z. He, "Gaitpart: Temporal part-based model for gait recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 14225-14233.
Y. Li, Y. Li, and N. Vasconcelos, "Resound: Towards action recognition without representation bias," in European Conference on Computer Vision, 2018, pp. 513-528.
L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, "Towards good practices for very deep two-stream convnets," arXiv preprint arXiv:1507.02159, 2015.
C. Chen, R. Jafari, and N. Kehtarnavaz, "UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor," in Proceedings of the IEEE International Conference on Image Processing, 2015, pp. 168-172.
T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, "Towards accurate generative models of video: A new metric & challenges," arXiv preprint arXiv:1812.01717, 2018.
A. Ranjan, D. T. Hoffmann, D. Tzionas, S. Tang, J. Romero, and M. J. Black, "Learning multi-human optical flow," International Journal of Computer Vision, vol. 128, pp. 873-890, 2020.
Z. Teed and J. Deng, "RAFT: Recurrent all-pairs field transforms for optical flow," in European Conference on Computer Vision, 2020, pp. 402-419.
S. Jiang, D. Campbell, Y. Lu, H. Li, and R. Hartley, "Learning to estimate hidden motions with global motion aggregation," in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 9772-9781.
X. Shi, Z. Huang, D. Li, M. Zhang, K. C. Cheung, S. See, H. Qin, J. Dai, and H. Li, "FlowFormer++: Masked cost volume autoencoding for pretraining optical flow estimation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 1599-1610.
H. S. Fang, J. Li, H. Tang, C. Xu, H. Zhu, Y. Xiu, Y. L. Li, and C. Lu, "AlphaPose: Whole-body regional multi-person pose estimation and tracking in real-time," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7157-7173, 2022.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., "An image is worth 16x16 words: Transformers for image recognition at scale," in International Conference on Learning Representations, 2021.
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-LM: Training multi-billion parameter language models using model parallelism," arXiv preprint arXiv:1909.08053, 2019.
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, "A theory of learning from different domains," Machine Learning, vol. 79, no. 1, pp. 151-175, 2010.
X. Y. Lin, H. Li, C. X. Lu, T. Zhao, A. Markham, and N. Trigoni, "Transferring physical motion between domains for neural inertial navigation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8356-8366.
S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth, "A comparative study of fairness-enhancing interventions in machine learning," in Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 329-338.
T. Baltrusaitis, C. Ahuja, and L. P. Morency, "Multimodal machine learning: A survey and taxonomy," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423-443, 2018.
A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, "Learning 3D human dynamics from video," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5614-5623.
Y. Ganin and V. Lempitsky, "Unsupervised domain adaptation by backpropagation," in International Conference on Machine Learning, 2015, pp. 1180-1189.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.