Abstract—This work proposes a supervised framework with deep convolutional neural networks (CNNs) for vision based sign language recognition for static gestures. Our approach addresses the acquisition and segmentation of correct inputs for the CNN based classifier. Cropped images of hands in different poses used to depict different Indian Sign language gestures are used as an input for our classifier. Microsoft Kinect sensor allows robust and fast hand tracking despite complex background under different lighting conditions. This sensor overcomes the major issues in real world environment such as background, lightning conditions and occlusions. The system was trained and tested on static ISL (Indian Sign Language) gestures of alphabets, letters and words.
I. INTRODUCTION Sign language is regarded as the most grammatically structured category of gestural communications. This nature of sign language makes it an ideal test bed for developing methods to solve problems such as motion analysis and human-computer interaction. Human Computer Interaction (HCI) exists everywhere in our daily lives. It is mostly achieved using a touch screen, mouse, keyboard etc. It creates a hindrance in Natural User Interface (NUI) as there is huge barrier between human and computer. [9] Researchers have used means of acquiring sign language using webcam, Kinect, instrumented gloves, Kinect sensors, myo armband etc. A. Need The number of deaf-mute people comprises more than 5 %. of the total world population. It has been observed that they find it really difficult at times to interact with normal people with their gestures, as only a very few of those are recognized by most people. Since people with hearing impairment or deaf people cannot use verbal form of communication so they have to depend on some sort of visual gestures for communication. Sign Language is the primary means of communication in the deaf and mute community. Sign language, like any other language, also has grammar and vocabulary. But, it uses visual modality for exchanging information. The problem arises when deaf or mute people try to express themselves to other
people with the help of these sign language grammars. This is because other people are usually unaware of these grammars. As a result, it has often been observed that communication of a deaf/mute person is only limited within his/her family or the deaf/mute community. The software is developed with an aim of developing a system to aid deaf/mute people which translate the static sign language hand gestures into text. This project has a vast scope as it bridges the gap between the hearing impaired and the rest of the society. Moreover, if a computer has an ability to translate and understand the hand gestures it would not only serve the deaf/mute community but also leap forward in the field of human computer interaction. B. Existing Solutions Gesture enabled HCI (Human Computer Interaction) transcends barriers and limitations by bringing the user one step closer to actual one to one interactivity with the computer. There has been much active research towards novel devices and techniques that allow gesture enabled HCI in recent years. There are generally two approaches to interpreting gestures in HCI by computers. First attempt to solve this problem resulted in a hardware-based approach. This approach requires user to wear bulky devices, hindering ease and naturalness of interacting with the computer. Although the hardware-based approach provides high accuracy, it is not practical in users everyday life. This has led to active research on more natural HCI technique, which is computer vision. This approach uses cameras and computer vision techniques to interpret gestures.[3] Research on vision-based HCI has enabled many new possibilities and interesting applications. Some of the most popular examplesaretabletop,visualtouchpad[3],TVremotecontrol, augmented reality and mobile augmented reality. Vision-based HCI can be further categorized into marker-based and markerless approach. Several studies utilize colour markers or gloves for real time hand tracking and gesture recognition. This approach is easier to implement and has better accuracy, but it is less natural and not intuitive. Other studies focused on marker-less approach by using different techniques such as
Haar-like [8] features, Convexity defects, K-Curvature, Bagof-features, Template Matching, Circular Hough Transform, Particle Filtering, and Hidden-Markov Model. Most studies on marker-less approach focused on recognize static hand poses, or dynamic gestures only, but not both. It means the variety of inputs is very limited. Several researchers use Haar-like [8] features, which requires high computing power. The classifier preparation stage also consumes a lot of time. Some studies use K-curvature to find peaks and valleys along a contour, and then classify these as fingertips.[5] However, it is also CPU intensive because all points along the contour perimeter must be calculated. Besides, they did not solve the problem of differentiating between human face and hand regions because they assumed that only hand regions are visible to the camera. While authors in utilize Haar-like [8] features to remove the face region first [6], it suffers from background colour leakage problems due to its simple background subtraction method. Most of the studies only focused on efficient hand recognition algorithm but did not translate the detected hand into functional inputs. Some authors utilize static hand gesture recognition to simulate gaming inputs by translating different static gestures into keyboard events. However, the limited set of static gestures makes game control difficult and boring. While most of the researchers develop sample application as a proof-of-concept, the hand tracking [4] capability is limited to their application only and is not able to interface with other applications other than passing simple mouse events.[9]
C. Challenges
Linguistic studies on ISL were started around 1978 and it has been found that ISL is a complete natural language, instigated in India, having its own morphology, phonology, syntax, and grammar [7]. The research on ISL linguistics and phonological studies get hindered due to lack of linguistically annotated and well documented ISL data. A dictionary of around 1000 signs in four different regional varieties was released [7]. However, these signs are based on graphical icons which are not only difficult to understand but also lack phonological features like movements and non-manual expressions. As it has been specified above, ISL is not only used by the deaf people but also by the hearing parents of the deaf children, the hearing children of deaf adults and hearing deaf educators [7]. Therefore, the need to build a system that canassociatesignstothewordsofspokenlanguage,andwhich can further be used to learn ISL, is significant. Most of the current systems suffer from the following limitations: • Most of the systems are native language specific and hence, cannot be used for ISL. • Most of the systems provide a word-sign search but very few systems provide a sign-word or sign-sign search. • Systems lack sophisticated phonological information like hand-shape, orientations, movements, and non-manual signs. • Requirement for long sleeves or gloves.
• It is difficult to segment hands from the image due to matching objects which have colour similar to that of hands. • Occluding hand and face regions in front of the cameras point of view.[9] • Random noise in captured images due to poor camera and lighting conditions. • Efficient tracking of moving hands requires advanced algorithms which often have higher computational complexity.
II. ARCHITECTURE OF THE PROPOSED SYSTEM: The suggested system architecture is shown in Fig 1.1. The overall system consists of six modules. They are summarized as below:
A. Live Video Feed : This module is responsible for connecting and capturing image output from Kinect v2 sensor for Microsoft Windows (RGB, depth and infrared images), and then processing this output with different image processing techniques.
Fig. 1: High level architecture of the proposed system.
B. Tracking through Kinect : Microsoft Pykinect 2 module and Pygame API were used for interacting with Kinect SDK for skeletal tracking. These modules provide us the positional coordinates of different joints of the human body which includes hand tips, wrists, elbows, ankles, shoulders, head and knees.
Fig. 2: Showing skeleton structure by joining the positional coordinates of related joints.
C. Hand segmentation : Studies [1], [2], [6] show that YCrCb colour ranges are best for representing the skin colour region. It also provides good coverage for different human races. The value is used as a threshold to perform skin colour extraction in YCrCb colour space and it is very efficient in covering a wide range of skin colours. However, it causes any object that contains a colour similar to skin such as orange, pink and brown to be falsely extracted. The morphological and smoothening filters were applied on the image to reduce the noise present in the image. The YCrCb range (based on some modifications to the range as cited in [9]) used to extract skin coloured object from the image is described as follows: 16 < Y <240 133 < Cr <173 77 < Cb <127
Fig.3:Extractingcroppedimagesofhandsthroughskincolour based segmentation.
The coordinates of finger tips and wrist joints provided by skeletal tracking module along with the skin segmentation were used to effectively crop the image segments of hands which were used as an input for the classifier.
D. CNN Classifier : The classification system is based upon convolutional neural networks. The classifier takes segmented image of hands and classifies it to the corresponding category. The classifier was trained on dataset which included image samples and corresponding labels for different ISL signs. The detailed discussion regarding CNN architecture and training will be carried out in the further part of this report. E. Word Generator : The class labels predicted by the classifier are used to form a continuous character stream. This character stream is then cleaned to filter the noise, and identify the desired character. The English language dictionary is used to build Aho-corasick trie and the cleaned character stream is used to find meaningful words in the text stream. F. Text to speech Synthesizer : Pyttsx python text to speech engine is used to speak the words generated by the word generator module which provides audio feedback to the user. III. CONCLUSION The conclusion goes here. ACKNOWLEDGMENT The authors would like to thank… REFERENCES [1] D. Chai and K.N. Ngan. Face segmentation using skin-color map in videophone applications. IEEE Transactions on Circuits and Systems for Video Technology, 9(4):551–564, 1999. [2] Tarek Mahmoud. A new fast skin color detection technique. World Acad Sci, pages 501–505, 11 2008. [3] Shahzad Malik and Joseph Laszlo. Visual touchpad: A two-handed gestural input device. In ICMI’04 – Sixth International Conference on Multimodal Interfaces, pages 289–296, 01 2004. [4] Cristina Manresa-Yee, Javier Varona, Ramon Mas, and Francisco Perales. Hand tracking and gesture recognition for human-computer interaction. Electronic Letters on Computer Vision and Image Analysis, ISSN 15775097 E:1, 01 2000. [5] Kenji Oka, Yoichi Sato, and Hideki Koike. Real-time fingertip tracking and gesture recognition. IEEE Comput. Graph. Appl., 22(6):64–71, November 2002. [6] S.Kr Singh, D.S. Chauhan, Mayank Vatsa, and Richa Singh. A robust skin color based face detection algorithm. Tamkang Journal of Science and Engineering, 6:227–234, 12 2003. [7] DeSantis S. Vasishta M., Woodward J. An Introduction to Indian Sign Language. All India Federation of the Deaf, 1998. [8] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, May 2004. [9] Hui-Shyong Yeo, Byung-Gook Lee, and Hyotaek Lim. Hand tracking and gesture recognition system for human-computer interaction using lowcost hardware. Multimedia Tools and Applications, 74(8):2687–2715,