Application areas:
Digital watermarking is considered as an imperceptible, robust and secure communication of data related to the host signal, which includes embedding into and extraction from the host signal. The basic goal is that embedded watermark information follows the watermarked multimedia and endures unintentional modifications and intentional removal attempts. The principal design challenge is to embed watermark so that it is reliably detected in a watermark detector. The relative importance of the mentioned properties significantly depends on the application for which the algorithm is designed.
For copy protection applications, the watermark must be recoverable even when the watermarked signal undergoes a considerable level of distortion, while for tamper assessment applications, the watermark must effectively characterize the modification that took place. In this section, several application areas for digital watermarking will be presented and advantages of digital watermarking over standard technologies examined.
Ownership Protection
In the ownership protection applications, a watermark containing ownership information is embedded to the multimedia host signal. The watermark, known only to the copyright holder, is expected to be very robust and secure (i.e., to survive common signal processing modifications and intentional attacks), enabling the owner to demonstrate the presence of this watermark in case of dispute to demonstrate his ownership. Watermark detection must have a very small false alarm probability. On the other hand, ownership protection applications require a small embedding capacity of the system, because the number of bits that can be embedded and extracted with a small probability of error does not have to be large.
Proof of ownership
It is even more demanding to use watermarks not only in the identification of the copyright ownership, but as an actual proof of ownership. The problem arises when adversary uses editing software to replace the original copyright notice with his own one and then claims to own the copyright himself. In the case of early watermark systems, the problem was that the watermark detector was readily available to adversaries. As, anybody that can detect a watermark can probably remove it as well. Therefore, because an adversary can easily obtain a detector, he can remove owner’s watermark and replace it with his own. To achieve the level of the security necessary for proof the of ownership, it is indispensable to restrict the availability of the detector. When an adversary does not have the detector, the removal of a watermark can be made extremely difficult. However, even if owner’s watermark cannot be removed, an adversary might try to undermine the owner. An adversary, using his own watermarking system, might be able to make it appear as if his watermark data was present in the owner’s original host signal. This problem can be solved using a slight alteration of the problem statement Instead of a direct proof of ownership by embedding e.g. "Dave owns this image" watermark signature in the host image, algorithm will instead try to prove that the adversary’s image is derived from the original watermarked image. Such an algorithm provides indirect evidence that it is more probable that the real owner owns the disputed image, because he is the one who has the version from which the other two were created.
Authentication and tampering detection
In the content authentication applications, a set of secondary data is embedded in the host multimedia signal and is later used to determine whether the host signal was tampered. The robustness against removing the watermark or making it undetectable is not a concern as there is no such motivation from attacker’s point of view. However, forging a valid authentication watermark in an unauthorized or tampered host signal must be prevented. In practical applications it is also desirable to locate (in time or spatial dimension) and to discriminate the unintentional modifications (e.g. distortions incurred due to moderate MPEG compression from content tampering itself. In general, the watermark embedding capacity has to be high to satisfy the need for more additional data than in ownership protection applications. The detection must be performed without the original host signal because either the original is unavailable or its integrity has yet to be established. This kind of watermark detection is usually called a blind detection.
Broadcast monitoring
Watermarking is an obvious alternative method of coding identification information for an active broadcast monitoring. It has the advantage of being embedded within the multimedia host signal itself rather than exploiting a particular segment of the broadcast signal. Thus, it is compatible with the already installed base of broadcast equipment, including digital and analogue communication channels. The primary drawback is that embedding process is more complex than a simple placing data into file headers. There is also a concern, especially on the part of content creators, that the watermark would introduce distortions and degrade the visual or audio quality of multimedia. A number of broadcast monitoring watermark-based applications are already available on commercial basis. These include program type identification, advertising research, broadcast coverage research etc. Users are able to receive a detailed proof of the performance information that allows them to:
1. Verify that the correct program and its associated promos aired as contracted;
2. Track barter advertising within programming;
3. Automatically track multimedia within programs using automated software online.
Information carrier
The embedded watermark application is expected to have a high capacity and to be detected and decoded using a using wavelets. While the robustness against
Intentional attack is not required; a certain degree of robustness against common processing like MPEG compression may be desired. A public watermark embedded into the host multimedia might be used as the link to external databases that contain certain additional information about the multimedia file itself, such as copyright information and licensing conditions. One interesting application is the transmission of metadata along with multimedia. Metadata embedded in, e.g. audio clip, may carry information about composer, soloist, genre of music, etc.
Watermark bit rate
The bit rate of the embedded watermark is the number of the embedded bits within a unit of time and is usually given in bits per second (bps). Some audio watermarking applications, such as copy control, require the insertion of a serial number or author ID, with the average bit rate of up to 0.5 bps. For a broadcast monitoring watermark, the bit rate is higher, caused by the necessity of the embedding of an ID signature of a commercial within the first second at the start of the broadcast clip, with an average bit rate up to 15 bps. In some envisioned applications, e.g. hiding speech in audio or compressed audio stream in audio, algorithms have to be able to embed watermarks with the bit rate that is a significant fraction of the host audio bit rate, up to 150 kbps.
Robustness
The robustness of the algorithm is defined as an ability of the watermark detector to extract the embedded watermark after common signal processing manipulations. A detailed overview of robustness tests is given in Chapter 3. Applications usually require robustness in the presence of a predefined set of signal processing modifications, so that watermark can be reliably extracted at the detection side. For example, in radio broadcast monitoring, embedded watermark need only to survive distortions caused by the transmission process, including dynamic compression and low pass filtering, because the watermark detection is done directly from the broadcast signal. On the other hand, in some algorithms robustness is completely undesirable and those algorithms are labeled fragile audio watermarking algorithms.
Security
Watermark algorithm must be secure in the sense that an adversary must not be able to detect the presence of embedded data, let alone remove the embedded data. The security of watermark process is interpreted in
The same way as the security of encryption techniques and it cannot be broken unless the authorized user has access to a secret key that controls watermark embedding. An unauthorized user should be unable to extract the data in a reasonable amount of time even if he knows that the host signal contains a watermark and is familiar with the exact watermark embedding algorithm. Security requirements vary with application and the most stringent are in cover communications applications, and, in some cases, data is encrypted prior to embedding into host audio.
THEORY
The fundamental process in each watermarking system can be modeled as a form of communication where a message is transmitted from watermark embedded to the watermark receiver. The process of watermarking is viewed as a transmission channel through which the watermark message is being sent, with the host signal being a part of that channel. In Figure 2, a general mapping of a watermarking system into a communications model is given. After the watermark is embedded, the watermarked work is usually distorted after watermark attacks. The distortions of the watermarked signal are, similarly to the data communications model, modeled as additive noise.
Fig 2: Basic Watermarking system equivalent to a communication system
In this project, signal processing methods are used for watermark embedding and extracting processes, derivation of perceptual thresholds, transforms of signals to different signal domains (e.g. Fourier domain, wavelet domain), filtering and spectral analysis. Communication principles and models are used for channel noise modeling, different ways of signaling the watermark (e.g. a direct sequence spread spectrum method, frequency hopping method), and derivation of optimized detection method (e.g. matched filtering) and evaluation of overall detection performance of the algorithm (bit error rate, normalized correlation value at detection). The basic information theory principles are used for the calculation of the perceptual entropy of an audio sequence, channel capacity limits of a watermark channel and during design of an optimal channel coding method.
During transmission and reception signals are often corrupted by noise, which can cause severe problems for downstream processing and user perception. It is well known that to cancel the noise component present in the received signal using adaptive signal processing technique, a reference signal is needed, which is highly correlated to the noise. Since the noise gets added in the channel and is totally random, hence there is no means of creating a correlated noise, at the receiving end. Only way possible is to somehow extract the noise, from the received signal, itself, as only the received signal can say the story of the noise added to it. Therefore an automated means of removing the noise would be an invaluable first stage for many signal-processing tasks. Demising has long been a focus of research and yet there always remains room for improvement. Simple methods originally employed the use of time-domain filtering of the corrupted signal, however, this is only successful when removing high frequency noise from low frequency signals and does not provide satisfactory results under real world conditions. To improve performance, modern algorithms filter signals in some transform domain such as z for Fourier. Over the past two decades, a flurry of activity has involved the use of the wavelet transform after the community recognized the possibility that this could be used as a superior alternative to Fourier analysis. Numerous signal and image processing techniques have since been developed to leverage the power of wavelets.
• .
• Unvoiced sounds result when the excitation is a noise-like turbulence produced by forcing air at high velocities through a constriction in the vocal tract while the glottis is held open. Such sounds show little long-term periodicity as can be seen from Figures 3 and 4 although short-term correlations due to the vocal tract are still present.
• Plosive sounds result when a complete closure is made in the vocal tract, and air pressure is built up behind this closure and released suddenly.
•
Some sounds cannot be considered to fall into any one of the three classes above, but are a mixture. For example voiced fricatives result when both vocal cord vibration and a constriction in the vocal tract are present.
Although there are many possible speech sounds which can be produced, the shape of the vocal tract and its mode of excitation change relatively slowly, and so speech can be considered to be quasi-stationary over short periods of time (of the order of 20 ms). Speech signals show a high degree of predictability, due sometimes to the quasi-periodic vibrations of the vocal cords and also due to the resonances of the vocal tract. Speech coders attempt to exploit this predictability in order to reduce the data rate necessary for good quality voice transmission
From the technical, signal-oriented point of view, the production of speech is widely described as a two-level process. In the first stage the sound is initiated and in the second stage it is filtered on the second level. This distinction between phases has its organ in the source-filter model of speech production.
Fig 3: Source Filter Model of Speech Production
The basic assumption of the model is that the source signal produced at the glottal level is linearly filtered through the vocal tract. The resulting sound is emitted to the surrounding air through radiation loading (lips). The model assumes that source and filter are independent of each other. Although recent findings show some interaction between the vocal tract and a glottal source (Rothenberg 1981; Font 1986), Font’s theory of speech production is still used as a framework for the description of the human voice, especially as far as the articulation of vowels is concerned
What is Speech Processing?
:
The term speech processing basically refers to the scientific discipline concerning the analysis and processing of speech signals in order to achieve the best benefit in various practical scenarios. The field of speech processing is, at present, undergoing a rapid growth in terms of both performance and applications. The advances being made in the field of microelectronics, computation and algorithm design stimulate this. Nevertheless, speech processing still covers an extremely broad area, which relates to the following three engineering applications
• Speech Coding and transmission that is mainly concerned with man-to man voice communication;
• Speech Synthesis which deals with machine-to-man communications;
• Speech Recognition relating to man-to machine communication.
Speech Coding:
Speech coding or compression is the field concerned with compact digital representations of speech signals for the purpose of efficient transmission or storage. The central objective is to represent a signal with a minimum number of bits while maintaining perceptual quality. Current applications for speech and audio coding algorithms include cellular and personal communications networks (PCNs), teleconferencing, desktop multi-media systems, and secure communications.
Speech Synthesis:
The process that involves the conversion of a command sequence or input text (words or sentences) into speech waveform using algorithms and previously coded speech data is known as speech synthesis. The inputting of text can be processed through by keyboard, optical character recognition, or from a previously stored database. A speech synthesizer can be characterized by the size of the speech units they concatenate to yield the output speech as well as by the method used to code, store and synthesize the speech. If large speech units are involved, such as phrases and sentences, high-quality output speech (with large memory requirements) can be achieved. On the contrary, efficient coding methods can be used for reducing memory needs, but these usually degrade speech quality.
Factors associated with speech:
Formants:
It has been known from research that vocal tract and nasal tract are tubes with non-uniform cross-sectional area. As sound generated propagates through these the tubes, the frequency spectrum is shaped by the frequency selectivity of the tube. This effect is very similar to the resonance effects observed in organ pipes and wind instruments. In the context of speech production, the resonance frequencies of vocal tract are called formant frequencies or simply formants. In our engineered model the poles of the transfer function are called formants. Human Auditory system is much more sensitive to poles than zeros.
Phonemes:
Phonemes can be defined as the “Symbols from which every sound can be classified or produced”. Every Language has its particular phonemes which range from 30 – 50. English has 42 phonemes. For speech crude estimation of information rate considering physical limitations on articulator motion is about 10 phonemes per second.
Types of Phonemes:
Speech sounds can be classified in to 3 distinct classes according to the mode of excitation.
1. Plosive Sounds
2. Voiced Sounds
3. Unvoiced Sounds
1. Plosive Sounds:
Plosive Sounds result from making a complete closure (again toward the front end of the vocal tract), building up pressure behind the closure, and abruptly releasing it.
2. Voiced Sounds:
Voiced sounds are produced by forcing air through the glottis with the tension of the vocal chords adjusted so that they vibrate in a relaxation oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract.
Voiced sounds are characterized by
• High Energy Levels
• Very Distinct resonant and formant frequencies.
The rate at which the vocal cord vibrates determines the pitch. These vibrations are periodic in time thus voiced sounds are approximated by an impulse train. Spacing between impulses is the pitch, F0.
3. Unvoiced Sounds:
Voiced Sounds are also known as formants generated by forming a constriction at some point in the vocal tract (usually toward the mouth end), and forcing the air through the constriction at high enough velocity to produce turbulence. This creates a broad-spectrum noise source to excite the vocal tract.
Unvoiced sounds are characterized by
• Lower Energy Levels than voiced sounds.
• Higher frequencies than voiced sounds.
In other words we can say that unvoiced sounds (e.g. /she/, /s/, /p/) are generated without vocal cords vibrations. The excitation is modeled by a White Gaussian Noise source. Unvoiced sounds have no pitch since they are excited by a non-periodic signal.
Spectrums of typical voiced and Unvoiced Speech
By passing the speech through a predictor filter a (z), the spectrum is much more flatten (whitened). But it still contains some fine details.
Special Type of Voiced and Unvoiced Sounds:
There are however some special types of voiced and unvoiced sounds which are briefly discussed here. The purpose of their discussion here is only to give the reader an idea about the further types of voiced and unvoiced speech.
Vowels:
Vowels are produced by exciting a fixed vocal tract with quasi periodic pulses of air caused by vibration of the vocal cords. The way in which the cross-sectional area varies along the vocal tract determines the resonant frequencies of the tract (formants) and thus the sound that is produced. The dependence of cross-sectional area upon distance along the tract is called is called area function of the vocal tract. The area function of a particular vowel is determined primarily by the position of the tongue but the position of jaws and lips to a small extent also affect the resulting sound.
Examples
a,e,i,o,u
Diphthongs:
Although there is some ambiguity and disagreement as to what is and what is not a diphthongs, a reasonable definition is that a diphthongs is a gliding monosyllabic speech item that starts at or near the articulatory position for one vowel and moves to or toward the position for another. According to this definition, there are 6 diphthongs in American English.
Diphthongs are produced by varying the vocal tract smoothly between vowel configurations appropriate to the diphthong. Based on these data, the diphthongs can be characterized by a time varying vocal tract area function which varies between two vowel configurations.
Examples:
Ei/ (as in bay), oU/ (as in boat), aI/ (as in buy), aU/ (as in how)
Semivowels:
The group of sound consisting of /w/, /l/, /r/, /y/ is quite difficult to characterize. These sounds are called semivowels because of their vowel-like nature. They are generally characterized by a gliding transition in the vocal tract area function between adjacent phonemes. Thus the acoustic characteristics of these sounds are strongly influenced by the context in which they occur. For our purpose they just considered as transitional vowel-like sounds and hence are similar in nature to vowels and diphthongs.
Nasals:
The nasal consonants /m/, /n/, and /ŋ/ are produced with glottal excitation and the vocal tract totally constricted at some point along the oral passage way. The velum is lowered so that air flows through the nasal tract, with sounds being radiated at the nostrils. Furthermore the nasal consonants nasalized vowels (i.e., some vowels proceeding or following nasal consonants) are characterized by resonances which are spectrally broader, or more highly damped, than those for vowels.
Unvoiced Fricatives:
The unvoiced fricatives are /f/, /θ/; /s/ and /she/ are produced by exciting the vocal tract by a steady air flow which becomes turbulent in the region of a constriction in the vocal tract. The region of the constriction serves to determine which fricative sound is produced. For fricative /f/ the constriction near the lips; for /θ/ it is near the teeth; for /s/ it is the near the oral tract; and for /she/ it is near the back of the oral tract.
Voiced Fricatives:
The voiced fricatives are /v/, /the/, /z/ and /Hz/is the counterpart of the unvoiced fricatives /f/, /θ/, /s/ and /she/ respectively, in that the place of constriction for each of the corresponding phonemes is essentially identical.
However the voiced fricatives differ from their unvoiced counterparts in the manner that two excitation sources are involved in their production. The spectra of voiced fricatives can be expected to display two distinct components.
Voiced Stops:
The voiced stops /b/, /d/, and/g/ are transient non-continuant sounds which are produced by building up pressure behind a total constriction somewhere in the oral tract, and suddenly releasing the pressure. For /b/ the constriction is at the lips; for /d/ the constriction is at the back of the teeth; and for /g/ it is near the velum. During the period there is a total constriction in the tract there is no sound radiated from the lips.
Since the stop sounds are dynamical in nature, there properties are highly influenced by the vowel which follows the stop consonant.
Unvoiced Stops:
The unvoiced stop consonants are /p/, /t/, and/k/ are similar to their voiced counterparts /b/, /d/ and /g / with one major exception. During the period of the total closure of the tract, as the pressure builds up, the vocal cords do not vibrate. Thus, following the period of closure as the air pressure is released, there is a brief interval for friction (due to the sudden turbulence of the escaping air) followed by a period of aspiration (steady flow of air from glottis exciting the resonances of the vocal tract) before voiced excitation begins.
Speech Recognition:
Speech or voice recognition is the ability of a machine or program to recognize and carry out voice commands or take dictation. On the whole, speech recognition involves the ability to match a voice pattern against a provided or acquired vocabulary. A limited vocabulary is mostly provided with a product and the user can record additional words. On the other hand, sophisticated software has the ability to accept natural speech (meaning speech as we usually speak it rather than carefully-spoken speech). Speech information can be observed and processed only in the form of sound waveforms. It is an essential for speech signal to be reconstructed properly. Moreover, for this to be process in a discrete Kaman filter, sampling plays a critical role. In the next section we will take a look at how sampling is done.
Why Encode Speech?
Speech coding has been and still is a major issue in the area of digital speech processing. Speech coding is the act of transforming the speech signal at hand, to a more compact form, which can then be transmitted with a considerably smaller memory. The motivation behind this is the fact that access to unlimited amount of bandwidth is not possible. Therefore, there is a need to code and compress speech signals. Speech compression is required in long-distance communication, high-quality speech storage, and message encryption. For example, in digital cellular technology many users need to share the same frequency bandwidth. Utilizing speech compression makes it possible for more users to share the available system. Another example where speech compression is needed is in digital voice storage. For a fixed amount of available memory, compression makes it possible to store longer messages.
Speech coding is a loss type of coding, which means that the output signal does not exactly sound like the input. The input and the output signal could be distinguished to be different. Coding of audio however, is a different kind of problem than speech coding. Audio coding tries to code the audio in a perceptually lossless way. This means that even though the input and output signals are not mathematically equivalent, the sound at the output is the same as the input. This type of coding is used in applications for audio storage, broadcasting, and Internet streaming.
Several techniques of speech coding such as Linear Predictive Coding (LPC), Waveform Coding and Sub band Coding exist. The problem at hand is to use LPC to code given speech sentences. The speech signals that need to be coded are wideband signals with frequencies ranging from 0 to 8 kHz. The sampling frequency should be at 8 kHz. Different types of applications have different time delay constraints. For example in network telephony only a delay of 1ms is acceptable, whereas a delay of 500 ms is permissible in video telephony. Another constraint at hand is not to exceed an overall bit rate of 8 kbps.
The speech coder that will be developed is going to be analyzed using both subjective and objective analysis. Subjective analysis will consist of listening to the encoded speech signal and making adjustments on its quality. The quality of the played back speech will be solely based on the opinion of the listener. The speech can possibly be rated by the listener either impossible to understand, intelligible or natural sounding. Even though this is a valid measure of quality, an objective analysis will be introduced to technically assess the speech quality and to minimize human bias. Furthermore, an analysis on the study of effects of bit rate, complexity and end-to-end delay on the speech quality at the output will be made.
Chapter 2
Previous method
Wavelet Domain Method:
FIBONACCI NUMBERS AND GOLDEN RATIO
The numbers 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …, known as the Fibonacci numbers, have been named by the nineteenth-century French mathematician Edouard Lucas after Leonard Fibonacci of Pisa, one of the best mathematicians of the Middle Ages, who referred to them in his book Liber Abaci (1202) in connection with his rabbit problem. The Fibonacci sequence has fascinated both amateurs and professional mathematicians for centuries due to their abundant applications and their ubiquitous habit of occurring in totally surprising and unrelated places [17]. In this paper we apply Fibonacci numbers for the first time for audio watermarking. The equation to produce the sequence of Fibonacci numbers is given below:
Fibonacci numbers have very interesting features. One of the most famous ones,