Home > Computer science essays > HUMAN action recognition

Essay: HUMAN action recognition

Essay details and download:

  • Subject area(s): Computer science essays
  • Reading time: 22 minutes
  • Price: Free download
  • Published: 7 July 2019*
  • File format: Text
  • Words: 6,334 (approx)
  • Number of pages: 26 (approx)

Text preview of this essay:

This page of the essay has 6,334 words. Download the full version above.

HUMAN action recognition has become one of the very
important topics on the field of pattern recognition
especially due to its continually growing use in modern applications
in everyday life. Automated crowd surveillance, smart
houses and assistive environments, gaming, automated sport
analysis, human-machine interaction and others are examples
of such applications.
The problem of human action recognition is the automatic
detection and analysis of human activities from information
acquired from cameras or other sensing modalities. Although
the idea is simple, the specific task is notably challenging
as any relevant system has to overcome a large number of
restrictive parameters. Illumination variations, camera view
angle, complicated backgrounds, occlusions are only a fraction
of the existing set of problems. In addition to the above
mentioned, individuality is another and very important factor
that cannot be neglected, as every person performs the same
set of movements (action) in a unique and different to every
other person’s way.
A. Human action recognition related work
The last decade, a large number of relevant algorithms
have been proposed, while 3D information has started to
play a leading role on newer technologies. Although the first
approaches on human action recognition based on 3D data
appeared in the early 1980s, the research was mostly focused
on data received by visible-light cameras [1]. As working
with visible-light captured by monocular sensors results in
considerable loss of information, the recent release of low-cost
depth sensors boosted further the growth of research on 3D
data. A recent review in [1] summarizes the major techniques
in human activity recognition and separates them into four
main categories: 3D from stereo, 3D from motion capture
and 3D from range sensors. The paper focuses especially on
techniques that use depth data. However, another recent survey
in [2] summarizes exclusively the techniques that are based on
depth imagery.
The nature of the features used to represent activities can be
considered a determinant classification factor for different human
action recognition techniques. Authors in [7] distinguish
two main categories: methods based on dynamic features,
which are dominant in the relevant literature and arguably
more successful, and techniques based on static, pose based
features which focus on extracting features from still images,
rather than image sequences.
Techniques based on still images usually employ human
silhouette extraction. Their advantage comes not from their accuracy,
which is generally inferior to the one of sequence based
methods, but from their ability to draw inference from single
frames. Authors of [3] and [4] present typical examples of
this methodology. The first study employs a bag-of-rectangles
based technique and the later performs behavior classification
by extracting eigenshapes from single frame silhouettes, with
the use of Principal Component Analysis (PCA).
With the increasing use of depth sensors, such as the
Microsoft Kinect, authors in [5] utilize infrared imaging to
enhance the accuracy of the detected pose. Classification is
handled using HOG-based descriptors, a method also preferred
in [6] where categorization is performed on a set of
hockey player actions. HOG-based descriptors are also used
in [7], where an approach that represents action classes with
histograms of pose primitives is formulated, to better handle
articulated poses and cluttered background. Finally, authors
in [8] presented a technique that focuses on extracting key
poses from action sequences. In essence, it selects the most
discriminative poses from a set of candidates, in an attempt to
avoid using complex action representations.
A number of motion sequence focused techniques, based
on the Local Binary Patterns (LBP) methodology, have been
proposed. Particularly, authors in [9] developed a method
that is resilient in texture variations caused by motion. In
[10], the authors work on the space-time domain, which is
partitioned along the three axes (x; y; t), in order to construct
LBP histograms of the x t and y t planes. Similarly, a
technique presented in [11] relies on a variant of LBP in order
to capture local features of optical flow and represent actions
as strings of atoms. An approach that uses depth information is
presented in [12], where motion cues are captured from depth
motion maps and LBPs are utilized to create more compact
Further research on the spatio-temporal feature extraction
for actions has given works such as the one presented in
[13], which uses hierarchically ordered spatio-temporal feature
detectors, inspired by biology. Space-time interest points are
also used to represent and learn human action classes in [14].
Works presented in [15], [16] and, more recently, [17] delved
further into the concept of exploiting spatio-temporal features
and combined optical flow based information with optical
features, showing better results. In another study in [18], a
spatio-temporal feature point detector is proposed, based on a
computational model of salience.
Many studies focused on different representation methods,
which characterized the produced techniques, as stated in
[19]. A common trend among researchers was to study the
evolution of the human silhouette through time. For instance,
authors in [20] introduced the use of temporal templates, called
Motion History (MH) and Motion Energy (ME), for action
representation. In [21], an extension of the previous study was
presented, inspired by MH templates. It introduced the Motion
History Volumes as a viewpoint independent representation.
Similarly, authors in [22] represented action sequences as
generalized cylindrical volume, while in [23], spatio-temporal
volumes were generated based on a sequence of 2D contours
that basically are the 2D projection of the outer boundary
points of an object performing an action in 3D, with respect
to time. The notion of space-time volumes is also used in [24]
and [25], which worked on silhouettes extracted over time.
Another set of methods, focused on extracting spatiotemporal
information from action sequences, was based on
the analysis of the structure of local 3D (space-time) patches
in an action video ([26], [27], [28], [29]). A trend, finally, is
the blending of various local features (spatio-temporal or not)
with different combinations of learning techniques. Hidden
Markov Models (HMM) ([30], [31], [32]) and Conditional
Random Fields (CRF) ([33], [34], [35]) are such examples.
Support Vector Machines (SVM) based learning is also used
in a multitude of studies, such as in [36], [16], [37], as well as
the recent work in [38] that blends learning with a manifold
based representation of features.
The detection of human falls is a very relevant task to
human action recognition and is of increasing interest lately.
However, as a problem it is usually addressed on its own
considering different approaches for the solution of it. Thus, a
different subsection (I-B) is provided considering the related
B. Human fall detection related work
The need to automatically detect falls has mainly arisen
from the tendency of elder people to live alone or spend a
lot of time unattended. Care for the elderly has traditionally
been the responsibility of family members and was provided
withing a home environment. Increasingly in modern societies,
state or charitable institutions are also involved in the process.
Decreasing family size, the greater life expectancy of elderly
people, the geographical dispersion of families and changes in
work and education habits have attributed to this [39]. These
changes have affected European and North American countries
but are now increasingly affecting Asian countries as well [40].
Research is focused on the autonomy of elderly people
which tend to live alone or are not able to indulge themselves
in the luxury of an attendance person. Falls are a major public
health issue among the elderly and, in this context, the number
of systems aimed at detecting them has increased dramatically
over recent years. According to the Center for Research and
Prevention of Injuries report, fall-caused injuries of elderly
people are five times as frequent as other injuries, a fact that
reduces considerably an elder person’s mobility and independence
[41]. According to the World Health Organization [42]
approximately 28-35% of people aged 65 and over fall each
year increasing to 32-42% for those over 70 years of age.
The frequency of falls increases with age and frailty level. In
fact, falls exponentially increase with age-related biological
changes, which leads to a high incidence of falls and fall
related injuries in the aging societies.
A division of fall detection techniques could be into two
main categories: wearable sensor based and vision based
techniques. The first category is based on wearable devices
such as accelerometers and gyroscopes, or on smartphones
that contain this kind of sensors and are mainly carried continuously
by subjects. The second category is based on 2D or
3D cameras, involving image analysis and pattern recognition
techniques of high computational complexity. Methods in the
latter category present the advantage that a continually carried
device is not required. Of course, multiple modalities may
be joined to produced composite methods. A characteristic
example of a multimodal approach is given in [44]. In other
studies, researchers in [45] divide fall detectors in three main
categories: wearable device based, ambiance sensor based and
camera (vision) based, while, from a different perspective,
researchers in [46] make distinctions based on whether a
specific method measures acceleration or not.
Primary attempts on providing a general overview of the fall
detection status are presented in [47] and in [46]. However, as
the advancement of technology on this area is rapidly growing,
these reviews are mostly outdated. A newer, comparative study
and more extensive literature review is provided in [43]. This
article aims to serve as a reference for both clinical and
biomedical engineers planning or conducting investigations
on the field. The authors are mostly trying to identify realworld
performance challenges and the current trends on the
field. A more detailed discussion is provided in [45] but lacks
references to new trends, such as smartphone based techniques.
In the direction of vision based solutions, such as the one
presented in this paper, researchers in [48], placed the camera
on the ceiling and analyzed the segmented silhouette and the
2D velocity of the subject. The determination of a fall is
achieved by an experienced thresholding. Authors in[49], in
order to draw a distinction between falling and other falllike
activities, such as sitting, added the extra information of
noise. However a sound-based system cannot be very robust
as most of the environments where such solutions are applied
are noisy. Another approach, presented in [50], is based on a
combination of motion history and human shape variation. To
cover large areas, wall cameras have been mounted and the
final decision is made by thresholding the extracted features.
In the technique documented in [51] the classification between
every day activities and fall events is achieved by extracting
the eigen-motion and by applying multi-class Support Vector
Other techniques, like the ones presented in [52] and [53],
use shape-based fall detectors which separate the human
silhouette by a regular bounding box or an ellipse and extract
geometric attributes such as aspect ratio, orientation or edge
points. Approaches like this lack robustness and generalization
of the application as they vastly depend on accurate extraction
of the human silhouette and the geometrical transformations
that may occur due to distance and the position of the subject
relative to the camera. In a more recent work in [55], the
method presented combines shape-based fall characterization
and a learning-based classifier, while in [56], human silhouette
is represented by ellipse fitting and motion is modeled by an
integrated normalized motion energy image. Shape deformation,
quantified from the fitted silhouettes is the basis for the
extracted features.
3D information provided by depth sensors, such as the
Microsoft Kinect, is shown to provide efficiency on partial
occlusion and viewpoint problems. Thus, a number of works
based on leveraging such is information have been published.
In [57], a velocity based method is presented, that takes into
account the contraction or expansion of the width, height
and depth of a 3D bounding box. A priori knowledge of the
scene is not required as the set of captured 3D information
is adequate to complete the process of fall detection. Another
approach creates two feature parameters: the orientation of
the body and the height information of the spine, using either
image or world coordinates, based on captured Kinect data
[58]. The Kinect sensor is also used in [59], where the
proposed algorithm is based on the speed of the silhouette head
(previously detected), the body centroid and their distance
from the ground. Because it incorporates positions of both the
body centroid and the head, this technique is regarded to be
less affected by the centroid fluctuation. Finally, a statistical
method based on Kinect as proposed in [60]. The decision
is made based on information about how the human moved
during the last few frames. This method combines a set of
proposed features under a Bayesian framework. This study’s
main focus is to create a technique that, while it has been
trained by data captured from a specific viewpoint, is also
able to classify falls that have been captured by a different
C. The proposed work: A preample
The work introduced in this paper is inspired by the study
in [61]. In that specific paper we examined the potential of
the original Trace Transform for human action recognition
and we proposed two novel feature extraction methods for the
particular task. The proposed techniques manage to produce
noise robust features that proved to be sufficient for successful
recognition of human activity when tested on two popular
However, both of the aforementioned techniques were based
on modeling actions in a per-frame fashion, not taking into
account any temporal interlinking between prominent features
in the action sequence. Although they show resilience
to occlusion, this may reduce their applicability on highly
occluded environments, where spatial information can be
distorted. Moreover, without any mechanism to cope with the
different lengths of action sequences, these techniques could
not accurately incorporate any information regarding rapid
position changes and velocity, which is vital in discriminating
between similar actions. For instance, in an unintentional
fall, we observe more abrupt position changes of the subject
than when performing a crouching action or lying down. The
methodologies presented in [61], lacking time sensitivity, do
not take this information into account.
In this paper, we define a new form of the Trace Transform
extending its capabilities to the 3D space and we propose a
novel feature extraction pipeline, suitable for activity recognition
in videos. More specifically, we propose a cylindrical form
of the Trace transform which is able to be applied on 3D data
such as spatio-temporal sequences. A set of different Trace
transforms, using different functionals (as seen in subsequent
sections) can be calculated, capturing different properties of
the sequence. The method is combined with Selective Spatiotemporal
Interest Points (SSTIPs), proposed in [17], to form
the 3D mesh and to adapt even better to the temporal nature of
actions and to enhance the importance of discriminant spatiotemporal
features to the final representative vector.
The technique has been tested on two different scenarios:
human action recognition and fall detection in realistic environments.
The databases used were the KTH [36], Weizmann
[67] and THETIS [68] datasets for action recognition, while
for the fall detection scenario we used the UR Fall detection
[78], [81] and the Le2i Fall detection [41] datasets. The results
indicated an impressive performance on all different datasets
indicating the potential of the proposed method.
The rest of the paper is organized as follows. The fundamental
theory behind the Trace transform and the 3D Radon
transform, which consist the source of inspiration for the
proposed 3D Cylindrical Trace transform, is presented in
section II. The presentation and the notation for the proposed
transform are also found in the same section. The overview
of the proposed scheme and the feature extraction procedure
are described in section III. The experimental procedure and a
discussion on the results are provided in section IV, followed
by a short conclusion in section V.
The Trace transform is a generalization of the Radon [62]
transform while at the same time Radon builds a sub-case of it.
While the Radon transform of an image is a 2D representation
of the image in coordinates  and p with the value of
the integral of the image computed along the corresponding
line, placed at cell (; p), Trace calculates functional T over
parameter t along the line, which is not necessarily the integral.
Trace transform is created by tracing an image with straight
lines where certain functionals of the image function are
calculated. Different transforms having different properties can
be produced from the same image. The transform produced is
in fact a 2-dimensional function of the parameters of each
tracing line. Definition of the above parameters for an image
Tracing line is given in Figure 1. Examples of Radon and
Trace transforms for different action snapshots are given in
Figure 2. In following, we provide a short description of the
Trace Transform based on the theory provided in [63].
Fig. 1. Definition of the parameters of an image tracing line.
Fig. 2. Examples of Radon and Trace transforms created from the silhouettes
of different action snapshots taken from various datasets.
To better understand the specific transform, let us consider a
linearly distorted object (rotation, translation and scaling). We
could say that the object is just perceived in another coordinate
system linearly distorted. This could be easier explained by
letting us call the initial coordinate system of the image C1
and the new distorted one, C2. Let us also suppose that the
distorted system can be obtained by rotating C1 by angle ,
scaling of the axes by parameter v and by translating with
vector (s0 cos 0;s0 sin 0). Suppose that there is a 2D
object F which is viewed from C1 as F1(x; y) and from C2 as
F2(~x; ~y). F2(~x; ~y) can be considered as an image constructed
from F1(x; y) by rotation by , scaling by v1, and shifting by
(s0 cos 0; s0 sin 0). A linearly transformed image is actually
transferred along lines of another coordinate system, as the
straight lines in the new coordinate system also appear as
straight lines.
In [61], two different ways using Trace transform have
been proposed for the extraction of features from human
action videos. Both methods (History Trace Templates (HTTs)
and History Triple Features (HTFs)) were able to create
representations of low dimensionality from an action sequence.
Both proved to be robust in noise and illumination variations.
However, lack of time sensitivity and occlusion issues could
not be effectively handled. This inspired the newly formulated
transform and the methodology presented in this study.
The proposed scheme has been designed for the scenario of
human action recognition and detection in video sequences.
It combines the proposed 3D CTT with a state of the art
algorithm for spatio-temporal interest point acquisition, the
so-called Selective Spatio-Temporal Interest Points (SSTIPs)
[17]. A three-dimensional spatio-temporal volume is crafted
based on the SSTIPs mesh and various 3D Cylindrical Trace
Transforms, using different functionals, are calculated by it. Finally,
the results are used in a triple-feature extraction scheme
that produces feature vectors of very low dimensionality. The
concept of this particular methodology is to take advantage of
the most valuable attributes each one of these techniques has
to provide and combine them in a final and straightforward
Spatio-temporal feature acquisition methods, such as STIPs
and Bag of Visual Words (BOVW), are very hot lately in
the field of action recognition. However, this kind of representations
ignore potential valuable information that refers
to the global spatio-temporal distribution of interest points
[65]. By introducing the Cylindrical Trace Transform, the
methodology presented in this paper manages to capture detailed
information about the geometrical distribution of interest
points, while at the same time it provides the versatility of
creating a large number of potential features for a variety of
capturing conditions, environments and applications. The use
and the combination of different and suitable functionals for
the calculation of different features can provide very robust
representations of an action video sequence, in the form of
feature vectors. More details on the individual techniques and
the proposed scheme are provided in the following subsections.
A. Selective Spatio-Temporal Interest Points
As mentioned above, the proposed scheme incorporates the
use of a novel approach to the STIPs acquisition problem,
presented in [17], the so-called Selective Spatio-Temporal
Interest Points technique. In this study, the authors proposed
a Spatio-Temporal Interest Points (STIPs) extraction methodology
which focuses on global motion instead of local spatiotemporal
information, thus preventing the erroneous detection
of interest points due to cluttered backgrounds and camera
motion. Furthermore, they show that their method performs
well in producing stable, repeatable STIPs, robust to the local
properties of the detector throughout the motion sequence.
One could summarize the selective STIPs pipeline as a
procedure that: 1. detects spatial interest points, 2. suppresses
unwanted background points and 3. imposes local and temporal
constraints on the result. The first step is essentially
conducted using a Harris corner detector. The underlying idea
behind the second step is the observation that corner points
detected in the background follow some particular geometric
pattern, while those on humans do not bear this property.
Finally the spatial and temporal constraints are imposed, based
on the notion that for an interest point to be considered an
accurate and repeatable STIPs, it should show a positional
change through the motion sequence. An example of extracted
SSTIPs from a sample of the THETIS dataset is given in
Figure 4.
Fig. 4. Selective STIPs extracted from a backhand shot video sequence from
the THETIS dataset. t denotes the direction of time.
At this point, we will document the experimental procedures
we followed in order to indicate the efficacy of the proposed
technique on the tasks of human action recognition and fall
detection. We will provide the experimental results for a
series of different known and challenging datasets and we will
demonstrate how the algorithm performs under different video
types and video capturing scenarios.
At this point and before we describe the experimental
protocols used, it is interesting to mention that, according
to [66], there is a variety of different experimental scenarios
used for the same datasets among researchers working on
action recognition from videos. It is also reported that methods
Fig. 6. Triple feature extraction from a spatio-temporal volume.
evaluated on known datasets such as KTH [36] and Weizmann
[67] may present result variations up to 10.67% when different
validation approaches are applied. However, there is still no
unified validation standard.
In the following experiments, when this was applicable (e.g
action recognition datasets), the leave-one-person-out cross
validation protocol was used for the evaluation of performance.
This protocol reconstructs the real life application needs in
the closest way. In a hypothetical real world scenario, the
physical dynamic behavior of an unknown subject is captured
by an action recognition system and thereafter processed
and compared against a pre-recorded set of data that have
previously trained it. The final decision is made based on
the relativity of the examined action, with one of the data
that comprise the training set, according to system’s set rule.
Accordingly, the above protocol uses one person’s samples
for testing and the rest of the dataset is used for training.
The procedure is repeated N times where N is the number
of subjects within the dataset. Performance is reported as the
average accuracy of I =
n=1 Hn iterations, where Hn is
the number of samples for the nth subject within the dataset.
Although fall detection can be classified as a human action
recognition task, the results are mostly calculated on a yes or
no basis as a different approach is required. Thus, a potential
system continually monitors a subject whose physical dynamic
behavior is captured. The captured behavior is analyzed at
regular intervals and compared against a pre-recorded set of
possible falls that has previously trained the system. The final
decision is made based on the relativity of the examined
behavior with one of the different fall samples and a fall
or a non-fall situation is reported. The protocol used for this
particular task, uses one sample for testing and the rest of the
set is used for training. The decision made is binary (0 or 1)
and is repeated I times where I is the number of samples in
the dataset. Performance is reported as the ratio of successful
classifications over I tests.
A. Action recognition experiments
For the experiments, three different datasets have been
used. The KTH, the Weizmann and the THETIS [68] action
databases. Figures 7, 8 illustrate various samples for the
different types of actions contained in the first two datasets.
The KTH video database contains six types of human actions
(walking, jogging, running, boxing, hand waving and hand
clapping) performed several times by 25 subjects in four
different scenarios, under different illumination conditions:
outdoors, outdoors with scale variation (camera zoom in
and out), outdoors with different clothes and indoors. The
database contains 600 sequences. All sequences were taken
over homogeneous backgrounds with a static camera with
25fps frame rate.
The Weizmann video database consists of 90 low-resolution
(180 x 144, deinterlaced 50 fps) video sequences presenting
nine different people. Each individual has performed 10 natural
actions such as run, walk, skip, jumping-jack (or jack, for
short), jump-forward-on-two-legs (or jump), jump-in-placeon-
two-legs (or pjump), gallop-sideways (or side), wave-twohands
(or wave2), wave-one-hand (or wave1), or bend.
Fig. 7. Action samples from Weizmann database for wave1, wave2, walk,
pjump, side, run, skip, jack, jump and bend.
Fig. 8. Action samples from KTH database for walking, jogging, running,
boxing, hand waving and hand clapping respectively.
The THETIS set is comprised of 12 basic tennis shots
performed by 31 amateurs and 24 experienced players. All
videos have been captured using the Kinect sensor who had
been placed in front of the subjects. Each shot has been performed
several times resulting in 8734 (single period cropped)
videos, converted to AVI format. The total duration of the
videos is 7 hours and 15 minutes. The shots performed are
the following: backhand with two hands, backhand, backhand
slice (or bslice), backhand volley (or bvolley), forehand flat
(or foreflat), forehand open stands (or foreopen), forehand slice
(or fslice), forehand volley (or fvolley), service flat (or serflat),
Fig. 9. Action samples from the THETIS database for the backhand, flat
service, forehand flat, slice service and smash moves. Top row: RGB samples,
middle row: depth samples, bottom row: 3D skeleton samples.
service kick (or serkick), service slice (or serslice) and smash.
Samples of THETIS dataset are illustrated in Figure 9.
In our experiments, the sequences for the KTH and the
Weizmann sets, which have a length of four seconds in
average, have been scaled up to the spatial resolution of
240*180 pixels, to improve the quality and quantity of the
acquired Selective STIPs. No background removal procedure
was used, as Selective STIPs were extracted directly from
the input grayscale frames. The leave-one-person-out crossvalidation
approach was used to test the performance of the
proposed algorithm in a more generalized way.
The THETIS dataset has been scaled down (for speed
reasons) to 320*240 from the initial resolution of 640*480.
In our experiments, three different data types of the dataset
have been utilized: RGB, Depth and 3D skeletons (Skelet3D).
In this case, no time aligning is taking place in all the three
different types of videos, neither background removal. In
the same manner as with the KTH and Weizmann datasets,
Selective STIPs were calculated on the RGB (turned to grayscale)
and depth (essentially gray-scale) frames directly, without
any background removal process taking place. Skelet3D
videos contain the space aligned skeletal representation of the
subject thus, as far as it concerns the specific type of videos,
background information is not incorporated by default. The
leave-one-person-out cross-validation protocol is also used as
After the videos were scaled to the desired resolution, the
complete proposed pipeline, described in III-C, used them as
input in order to create feature vectors that in following where
fed to a series (as many as the number of classes in each
case) of Support Vector Machines (SVMs). To experiment on
the variations in the results produced by using different values
for the plane rotation  step, the pipeline was tested with step
value 9o and 6o. One can intuitively determine that the smaller
this step is, the closer it gets to the continuous form of the
proposed transform. This may offer more robust features, at
the cost of time efficiency though. Finally, PCA is performed
on the produced vectors, in order keep the subset of the most
discriminant features and reduce dimensionality.
Feature vectors were then used to train a number of SVMs,
as dictated by the protocol followed . The Gaussian Radial
Basis Function kernel was used for the mapping of the
training data into kernel space. At this point, we should note
that human action recognition is a multi-class classification
problem. We cope with this, by constructing the problem as
a generalization of binary classification. More specifically, for
each dataset, we trained 6, 10 and 12 different SVMs (one
for each class of the KTH the Weizmann and the THETIS
database respectively) using an one-against-all protocol. The
final decision was made by assigning each testing sample to a
class Ca, according to the distance d of the testing vector from
the support vectors, where Ca is the set of templates assigned
to an action class (e.g boxing). However, since we wanted to
evaluate the generalization of the algorithm in a more broad
way, we measured the successful binary classifications of every
sample, tested on each of the different trained SVMs. The
results of this experimental procedure can be found in IV-C.
B. Fall detection experiments
In this section we will present the experimental results and
the procedure followed for the evaluation of the proposed
technique when tested on the scenario of human fall detection.
In general, there are only few available datasets dedicated to
fall detection as most of the published techniques have been
tested on the respective author’s own datasets. However, in
order to have a benchmark, we have evaluated our technique
in two new publicly available datasets: The UR Fall Detection
[78], [81] and the Le2i Fall detection datasets [41].
The UR Fall dataset contains 60 sequences recorded with
2 Microsoft Kinect cameras and corresponding accelerometric
data. Sensor data was collected using PS Move (60Hz) and x-
IMU (256Hz) devices. The dataset contains sequences of depth
and RGB images for two differently mounted cameras (parallel
to the floor and ceiling mounted, respectively), synchronization
data, and raw accelerometer data. Each video stream, both
RGB and depth, is stored in separate folders in the form of
png image sequences. From the specific dataset we have used
the depth data provided by the ceiling mounted camera as well
as the frontal RGB videos, following the experimental protocol
given by authors in [78] and [81]. Frame samples taken from
UR fall dataset are provided in Figure 10.
Experimentation on the UR fall detection dataset was divided
into two phases. The first one aimed to evaluate the
ceiling mounted depth camera scenario, corresponding to the
methodology presented in [78]. More specifically, a set of
60 cropped motion sequences from the ceiling depth video
subset were used. These motion sequences contained both
unintentional falls, such as tripping and falling from a chair,
and other everyday activities, such as walking. Artificial, falllike
activities were added to the dataset, to make the problem
more challenging. Motion sequences of persons almost falling
from chairs were hand-crafted and added to the dataset.
Background segmentation, noise reduction and thresholding
techniques were used to extract binary and depth silhouettes.
Spatio-temporal points were calculated on these silhouettes.
On the second phase, we experimented with the (newly
added at the time) frontal RGB videos, similarly to the
experiments conducted in [81]. This was an attempt to fully
evaluate the capabilities of the proposed method in environments
where background segmentation is a non-trivial,
Fig. 10. Frame samples taken from the UR fall dataset for two falls.
Upper row illustrates the RGB samples while the lower row provides the
corresponding depth images.
Fig. 11. Frame samples taken from the Le2i fall dataset. Upper row illustrates
samples from daily actions in ”Coffee room” while lower raw provides
samples from a fall occurred in ”Home”.
error-prone procedure. For this reason, no human silhouette
segmentation was performed and the experiments relied only
on the spatio-temporal information from the subjects motion
inside the video. This part of the dataset was used uncropped,
i.e. with each video containing a full set of human actions such
as entering a room, walking inside and then performing an
intentional or unintentional fall. Activities closely resembling
falls (not hand-crafted) were added to this set, such as crouching
under a sofa, lying on a bed, bending to tie shoelaces, etc.
The Le2i Fall dataset has been captured in realistic video
surveillance settings using a single RGB camera. The frame
rate is 25 frames/s and the resolution is 320×240 pixels. The
video data illustrates the main difficulties of realistic video
sequences that can be found at an elderly home environment,
as well as in a simple office room. The video sequences contain
variable illumination, and typical difficulties like occlusions
or cluttered and textured background. The actors performed
various normal daily activities and falls. The dataset contains
130 annotated videos, with extra information representing the
ground-truth of the fall position in the image sequence. The
database provides different locations for testing and training,
while authors in [41] have defined several protocols for the
evaluation of their method. Working with the specific dataset,
we have followed the protocol P1 given in the above paper,
where training and test sets are built with videos from ”Home”
and ”Coffee room” subsets. Samples from Le2i dataset are
provided in Figure 11.
Both in the first experimental scenario on the UR dataset
and the Le2i dataset, extraction of feature vectors using the
proposed scheme has been preceded by human silhouette
extraction. In the UR dataset case, this was handled by computing
differences between depth pixels in a particular frame
and their corresponding pixels in a precalculated reference
frame. The reference frame was calculated by computing the
median value of every depth pixel in a sliding window of 9
frames, in a total of 80 frames portraying a scene lacking
human presence. Then, the mean value of every median pixel
value was calculated, forming the final reference frame and
eliminating a considerable amount of noise generated by the
depth camera.
One can correlate the human presence in a particular frame
with the occasions when the difference between depth pixel
values of that frame and the reference exceeds a predefined
threshold. In the case of the UR dataset, a total of four
thesholds were used, to add robustness. The first two were
used to filter out noisy and invalid pixels. For a pixel value to
be valid (i.e. possible to be part of a human silhouette), it was
required to be between 1100 and 3620 millimeters. It should be
reminded that this represents distance from a ceiling mounted
depth camera, such as Kinect, whose depth map values are
measured in millimeters, in the [800; 4000] range. Afterwards,
to indicate human movement, the difference between a pixel
of the current frame and a reference pixel was required to be
between 50 and 2200 millimeters. These values were found to
offer maximum tolerance against random noise.
Given the fact that it consisted of RGB video files of low
resolution, the Le2i dataset was handled in a different way.
Furthermore, light conditions in most of the cases (especially
in the ”Home” subset) rendered the use of the difference
between frames unreliable. In order to segment the human
silhouette, the background-foreground segmentation approach
proposed by Zivkovic in [82] and [83] was utilized. In this
technique, a subtraction between the current frame and a
background model is performed. This model is constantly
updated in a per-pixel fashion, using a gaussian mixture-based
approach, to contain what is considered the static part of the
scene, adapting in scene changes in the video sequences.
In our experiments, sequences of both datasets have been
scaled down to the spatial resolution of 320*240 pixels and
have a temporal length of 26 and 12 frames on average for the
UR and the Le2i dataset respectively. For the ceiling mounted
camera scenario on the UR dataset and the Le2i dataset,
training and testing samples were constructed by manually
cropping the motion sequences to contain only the fall or falllike
activity part. The feature vector extraction pipeline used
in all fall detection scenarios is the same that was documented
previously, in the action recognition experiments.
At this point, we should mention that there is no unified
standard to follow for the evaluation of fall detection algorithms.
In the experiments conducted in this study, a simple
leave-one-out protocol was used to evaluate performance.
Contrary to the procedure followed in the action recognition
experiments, lack of different activity samples performed by
distinguishable persons led to adapting the original leaveone-
person-out protocol to a simplistic leave-one-sample-out
protocol, as mentioned earlier. In every iteration, a different
activity sample, regardless of whether it depicts a fall or not,
is used for testing a system that is previously trained using the
rest of the dataset. The results of this experimental procedure
can be found in the next subsection (IV-C).
C. Results
Comparative results on the action recognition task can be
found in Table II. In addition, Tables III, IV, V, VI and
VII show the confusion matrices generated by the presented
method on the examined action datasets. Rows depict the
accuracy achieved (correct answers/all answers) by all trained
SVMs on a certain action class. Columns, on the other hand,
show the performance of individual, class-specific SVMs on
all action classes. Finally, experimental results on fall detection
can be found in Table VIII.
As seen in Table II, the feature extraction pipeline based on
3D CTT and the Selective STIPs achieves impressive accuracy
on all examined datasets and is on a par with or even outperforms
other published methods in the field. Particularly on the
KTH database, the presented technique achieves an impressive
99.98% accuracy, which is validated by the corresponding
confusion matrix (Table III). In a notable comparison, the
pipeline proposed by Yuan et al. in [65], which is based on features
extracted using a formulation of the 3D Radon transform
and a combinatorial STIP + BoVW method, achieves 95.49%
accuracy on the same dataset. Results on the Weizmann, where
accuracy of 96.34% was achieved, suggest the existence of a
slight confusion of the class-specific classifiers, especially of
the ones dedicated to the ”jump” action.

...(download the rest of the essay above)

About this essay:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, HUMAN action recognition. Available from:<https://www.essaysauce.com/computer-science-essays/human-action-recognition/> [Accessed 10-06-24].

These Computer science essays have been submitted to us by students in order to help you with your studies.

* This essay may have been previously published on Essay.uk.com at an earlier date.