We accept

Image To Voice Converter Is Software Computer Research Essay

Image to Tone converter is software or a device to recognize an image and convert it into human being voice. The purpose of the change is to provide communication help for blind people to sense what the thing in their hands or in front of them. This converter is also suitable for children at the age of three until six yrs. old for early education part.

In this task converter, it involves image control and sound generation. For an image processing, it is a series of calculation approaches for examining, reconstructing, compressing, and improving images. When an subject is inputting, an image will captured through scanning or webcam; analyze and manipulate of the image, achieved using various professional applications such as MATLAB and productivity like a printer or a keep an eye on.

Image processing has several techniques, including template matching, KNN (K-Nearest Neighbour), thresholding and etc. For the template matching, it is a method for finding small elements of an image to complement with the template image; additionally it is used to identify printed characters, figures, and other small, simple items. KNN (K-Nearest Neighbour) is an algorithm that can work very well in practice and easy to comprehend. Additionally it is a lazy algorithm that will not use the training data points to do any generalization. Besides, thresholding approach is one of the main approaches to image segmentation. It really is a non-linear procedure that can converts a gray-scale image into a binary image.

The purpose of image control in this task is to evaluation of a picture using techniques that can identify shades, colours and relationships that cannot be noticed by the human eye. Besides that, an image processing is used to solve identification problems, i. e. in forensic remedies or in establishing weather maps from dish images. It assigns with images in bitmapped design form that have been scanned in or taken with digital camera models. For sound generation is to generate a sound through window sensible library or play a wav file from computer.

Problem Statement

Nowadays, many aesthetically impaired people still using blind man's stick to sense the road of the path and object before them in this culture. With just only a plain stick and a set of covered eyes, it is difficult for a human being to get sense of their direction. Probably, they would not know very well what the objects around the people which had been blinded eye. Even as can easily see the current economic climate nowadays gets worse, most of the people or family were getting occupied on their active work life; they have no extra time to invest on the handicap people to give them a good care. In cases like this, for all your handicap people especially blind people, they need to get use to it on their living style. In order than that, the product is also open to help the small kid's to increase the ability on distinguishing or distinguish the daily use items. This is the reason why the merchandise mentioned above originated.

Project Target and Goal:

The aim of this job is to develop an Image to Voice converter which able to recognize an image from the webcam and then convert it into sound by window acoustics library or wav file with good performance. To attain the main purpose of this job, there are sub-objectives have to be carry through as follows:

To develop a unique image acknowledgement algorithms for figures and colorings for real-time software using MATLAB.

To analyze the performance of the image popularity algorithm in term of precision and time handling.

To develop an algorithm to convert recognized image to tone of voice using MATLAB.

To analyze the performance of image to tone alteration algorithm.

Test the performance of the finished loop user interface for the image and audio handling converter system.

To develop Graphical INTERFACE (GUI) of the image to words converter for circumstance of individual finding.

Project Scope/Limitation

The scope of this project is to create a distinctive image to tone of voice converter inside a time period at cost never to exceed RM200. Referring to this project, it consists of hardware which is webcam and software which is MATLAB. The machine of this task is to fully capture an image using webcam, then realize an image and make a sound using MATLAB with several techniques. The product specially designed for aesthetically impaired people or to improve small kid's learning ability. There is few limitation of this project which given as follows:

Shape limitation

Colour limitation

Resolution limitation

Distance limitation

Literature Review

Image handling is a technique to convert a graphic into digital specs and go through some actions on it, so as to get an improved image or even to accumulate some advanced information from it. It is some sort of signal exemption where type is image, like video recording frame or photograph and output may be image or features related with that image. Frequently, image control institution consist of treating images as two dimensional indicators while making use of already set indication processing techniques to them[1]. For the image acknowledgement process can be split into several algorithms which can be image acquisition, image pre-processing, image segmentation, image representation and image classification. For the image acquisition, it is a digital image that captured by one or a few image receptors, such as various types of light-sensitive camcorders, range sensors, tomography devices, radar, ultra-sonic video cameras and etc. Regarding to the kind of sensor, the outcome of an image data can be an generally two dimensional image, a 3d capacity, or an image order. The pixel values usually match durability of light in one or a few spectral bands, but can also be included many physical actions, such as depth, absorption or reflectance of sonic or electromagnetic waves, or nuclear magnetic resonance.

Image pre-processing is one of the algorithms that can boost the dependability of your optical inspection. This algorithm can be categorized into two categories that happen to be image enhancement. Image augmentation requires intensifying the several features of images either for screen or analysis goals. The enhancements techniques are edge enhancements, noise filtering, magnifying and sharpening an image. Several filter functions which increase or reduce certain image features allow a less strenuous or faster evaluation. For examples, mean filtration system, median filtration, wiener filtration system, and etc. With continuous use, an image will becomes degraded and has many mistakes. Image repair is the procedure used to revive the degraded image. This technique is also used to correct images read from different detectors that show up murky or away of concentration[2].

Next, image segmentation is performed to put together pixels into salient image areas, for example, areas matching to specific surfaces, items, or inherent sections of objects. Segmentation could be utilized for object acceptance, occlusion boundary estimation within movement or stereo systems, image denseness, image editing and enhancing, or image data source. The original image segmentation method can be divided into several techniques including grey threshold segmentation method, advantage extraction method, regional growth method and split consolidation method and etc. Threshold strategy was applied in this project. It is a method that handles gray-scale images. For the moment of the influence of sound or illumination, it can be assumed that most pixels belonging to the items will have a relatively low gray-level, whereas the backdrop pixels will have a relatively high gray-level. For example, Black is represented by the gray-level of 0, and White by the gray-level of 255. Predicated on this observation, we can separate the pixels in the image into two dominant groupings, according with their gray-level. These gray-levels may provide as "detectors' to distinguish between qualifications and objects in the image. Alternatively, if the image is one of smooth-edged things, then you won't be a natural black color and white image; hence this would not have the ability to find two particular gray-levels characterizing the backdrop and the objects. This problem intensifies with the life of noise[3]. In order to overcome the ill influence of noises and shading, there are two methods that can solve this problem which are Otsu known as "Global Threshold" and Neighbourhood known as "Adaptive Threshold".

For the image representation, all information is commonly represented in binary. This is real of images as well as figures and word. However, an important differentiation must be made between how image data is shown and how it is stored. Showing includes bitmap representation while keeping as a document includes many image platforms, such as jpeg and png[4]. You will find few techniques for image representation which can be Roundness percentage known as Circularity, Fourier Descriptors and etc.

The intent of the image classification method is to sort out all pixels in a digital image into one of several land cover categories, or "themes". This grouped data may then be used to provide thematic maps of the land cover within an image. Ordinarily, multispectral data are used to carry out the classification and truly the spectral routine present within the info for every single pixel is used as the numerical basis for categorization. The purpose of image classification is to ascertain and identify, as a definite gray level or colour, the characteristics occurring in an image in terms of the object or kind of land cover these characteristics practically express on the ground[5]. The technique for this algorithm is using template matching and KNN (K-Nearest Neighbour).

Table : Evaluation of image detectors for image acquisition[6, 7]

Types of Image Sensor





- allow in person interaction

- low cost

- easy to use

- low resolution

- not portable

- no optical move lenses

- no auto-focus


Digital Camera

- high resolution

- lightweight with batteries

- has optical zoom lenses

- has auto-focus

- high operating speed

- less durability

- battery intake faster

- high cost

- many complicated function

From the Stand 1, it can be seen that both image receptors have its talents and weaknesses. This research will more concentrate on webcam for this reason image sensor is using for this project. Webcam can be used to hook up with computer to capture a graphic for image popularity. Alternatively, it is simple to use and cheaper compare with camera which is more technical and high cost. However, the megapixel of digital camera is greater than webcam.

. .

Table : Contrast of various kinds filter for image pre-processing[2, 8]

Types of filter




Median filter

- more robust

- more smoothing

- provide good results

- storage consuming

- intricate computation


Mean filter

- intuitive

- easy to use

- smoothing

- not good in sharpen images

- susceptible to negative outliers


Wiener filter

- brief computation time

- controls productivity error

- straightforward to design

- results often too blurred

- spatially invariant

From the Table 2, it could be seen that all filters have its talents and weaknesses. This research will focus on two types of filtration which are median filtration system and mean filtration. Median filter have been chosen for this project is because median filtration system is more robust normally than mean filtration and so a not representative pixel in a neighbourhood won't affect the median value significantly. Since the median value needs to be the worthiness of one of the pixels in the neighbourhood, the median filter does not establish new unrealistic pixel values when the filter straddles an advantage. This is as a result of median filter is way better at preserving sharpened sides than the mean filtration. Also, median filtration removes the noises level more than mean filtration system.

Table : Contrast of threshold techniques for image segmentation [9, 10]

Threshold Techniques





- fast

- ease of coding

- easy to use

- less sensitivity

- assumption of homogeneous illumination

- does not use any object structure or spatial coherence

- sophisticated computation



- create a good result

- less computation

- storage consumption

- time consumption

- sensitive

From the Table 3, it could be seen that both techniques have its own strengths and weaknesses. Otsu's method, called after its inventor Nobuyuki Otsu, is a worldwide threhold that involves many binarization algorithms[11]. This technique entails iterating through all possible threshold ideals and processing a measure of propagates for the pixel levels each area of the threshold, i. e. the pixels that may be falls in background or foreground. The purpose is to find the threshold value where the total of foreground and record propagate is at its minimum. Neighbourhood which known as adaptive threshold is used to separate suitable foreground image things from the background based on the difference in pixel intensities of each region. The variations between both methods were Otsu uses a histogram to threshold the image and the Neighbourhood method runs on the histogram to threshold the pixels in a small region/neighbourhood round the pixel. Furthermore, Otsu methods are affected less errors occur that are brought on by the sensitivity of the neighborhood algorithms to image noises compare with the Neighbourhood methods.

Table : Assessment of the two approaches for image representation[12]

Techniques of Image Representation




Roundness Ratio

very fast algorithm

scale, position and rotation invariant

high correctness if image condition can be maintained properly after segmentation

susceptible to problems if object shape is changed scheduled to incorrect segmentation


Fourier Descriptor

- medium speed

- create a good result

- low computation cost

- beat the poor discrimination ability

scale, position and rotation invariant

- difficult to acquire high order invariant moments

- cannot package with disjoint shapes

From the Stand 4, it can be seen that both techniques have its talents and weaknesses. Roundness is identified in term of a surface of revolution like cylinder, cone or sphere where all marks of the surface alternated by any planes vertical to a axis in case there is cylinder and cone are equal in distance from axis. As the axis and centre do not are present, measurements have to be made with discussion to surfaces of the characters of revolution only. The circularity of the format is to measuring roundness[12]. Fourier Descriptors are used to describe the feature of contour of form. It had been founded in the early sixties previous century by Cosgriff and Fritzsche. According to the Fourier research theory, Fourier coefficients can be often made by Fourier transformation. Lower frequency coefficients have the general condition of the personal, and higher rate of recurrence coefficients possess the more info about the shape. As the harmonic amplitude and the phase perspective can represent the Fourier Descriptor, and Fourier coefficients are usually normalized by dividing the first Fourier coefficient individually. Because there are some fast algorithms in processing the coefficient of Fourier series, many recognition systems in machine eyesight using these coefficients as form features.

Table : Comparability of several techniques for image classification [13-15]

Techniques for Image Classification




Template Matching

- easy to implement

- high amount of flexibility

- high precision of detection

- condition limitation

- computation speed

- vunerable to scaling and rotation


K-Nearest Neighbour

- easy to implement

- very effective

- improve accuracy

- improve run-time performance

- poor run-time performance if working out set is large

- very sensitive

- outperformed by more spectacular techniques


Neural Network

- minimize energy function

- high accuracy

- easy to use

- unstable

- curse of dimensionality

- space consumption

From the Stand 1. 5, it can be seen that techniques have its advantages and weaknesses. This research will focus on two techniques which are Template Matching and K-Nearest Neighbour. The typical template matching approach is known as simple device, high correctness of detection, and is employed as a general model examination and mistake estimation. Hence, it performs an essential role in image handling, and is often used in thing detection and recognition. But the contradiction between rapidity and exactness is exceptional. The main factors affecting rapidity are looking calculation, and procedures of template matching. Correctly decreasing positions and similarity computing precision can increase the rate of template matching naturally. That is learning to be a concentration in this field. Many reports focus on enhancing the searching algorithm, reducing the matching times by decreasing the matching tips on the template of images, which need to be found so that rapidity is recognized. The normal algorithms are pyramid algorithm, genetic algorithm and so on. Each matching procedure is dependant on the template matching, thus it's important to focus on bettering the computation rate of template matching fundamentally[14]. The intuition underlying Nearest Neighbour Classification is quite simple, examples are categorised based on the class of these nearest neighbours, it is beneficial to take more than one neighbour into account so the technique is additionally known as K-Nearest Neighbour (KNN) Classification where k-nearest neighbours are used in identifying the class. Since the training good examples are needed at run-time, i. e. they have to be in storage at run-time; it may also be also called Memory-Based Classification. Because induction is delayed to perform time, it is considered a Lazy Learning technique[13].


Analysis on Similar Products and Newspaper Literatures

Oral Image to Speech Converter by Takaaki HASEGAWA and Keiichi OHTANI[16]:

In this newspaper, the authors propose a new speech communication system to convert oral image into words, "Image suggestions Microphone". This system synthesizes the tone of voice from only the oral image. This system provides high security and is also not damaged by acoustic noises, because real utterance is not necessarily necessary to insight. Moreover, since the words is synthesized without recognition, this technique is unbiased of dialects.

Simulations to convert dental image to voice about Japanese five vowels are carried out as basic inspection. A vocal tract area function is projected from the dental image, and PARCOR synthesis filter is extracted from the vocal tract area function. The PARCOR synthesis filtration is driven with a pulse coach. The performance of the system is assessed by hearing tests of the synthesized words. Because of this, audible tone of voice has been synthesized and the mean acknowledgement rate of Japanese five vowels has been 91%.

This paper describes a system to convert dental image into words with considering human's lip-reading ability. Within the proposed system, the tone of voice is immediately synthesized only from the dental image without reputation, and real utterance is not always necessary to input. They use both feature of the tongue and the feature of lips obtained from the dental image. Therefore this technique is not affected by the acoustic sound, and simultaneously, it offers high security because of no utterance type capability.

The system composition of this product is using a vocal tract area function which is equivalent to the copy function of the vocal tract as a parameter. "Indirect" means synthesis via the vocal tract area function. The vocal tract area function is from the PARCOR analysis of speech signals, and speech alerts are synthesized by inverse control of PARCOR evaluation. Therefore the vocal tract area function is believed from dental image impulses, they can convert the oral image to the equivalent voice. Human utters various voice by changing the vocal tract, and each articulator moves not individually but cooperatively in utterance, It really is generally known that the information of articulation is from lip-reading.

Software Comparison

Table below demonstrates the two contrast of the program between MATLAB and C++.

Table : Comparison of software between MATLAB and C++[17]

Types of Software





- easy to learn

- fast numerical algorithms

- inexpensive software

- fast development

- slow-moving processing

- intricate computation



- mature standard

- large community

- fast

- intricate computation

- difficult to debug

- low level programming

From the Desk 6, it can be seen that both types of software have its talents and weaknesses. MATLAB is software that has been widely used in image control and computer perspective community. Multiple image evaluation function has been build into this software; it is very useful image research tools for end user. C++ is a standard template library (STL), computer design, and image processing. Based on C++ template mechanism, the library accepts all C++ build-in types as the image data, although certain functions are only valid to subset of build-in types. MATLAB has been determined because of the project analysis characteristic. MATLAB version R2010b will be used to analyze the image quality and performance in this project.

Project Methodology

This project has been divided into hardware and software. For the hardware section is the webcam as the type and loudspeaker as the outcome. For the software section is using MATLAB to identify image to sound with several image finalizing techniques.

Block Diagram


Image Segmentation


Image Acquisition

(Acquire image)

Image Preprocessing

(Median filtering)


Image Representation

(Roundness Proportion)

Sound Generation

(WAV file)

Image Classification (Template Matching using KNN)


Figure 1: Block diagram of Image to Words converter.

The block diagram shown in Body 1 is the basic concept on the system interface that would have to be carried out. Basic on the market diagram, first well prepared a webcam. Then, get the image before the webcam. After that, perform a median filtering in image pre-processing using MATLAB. It will filtered unwanted signal or sound inside the image. Next is image segmentation, discussing the books review, the most suitable method is using Otsu's method in thresholding techniques to convert grayscale image into binary image to do segmentation. Second, find the greatest subject and do the image representation using roundness ratio to estimate the proportion of the most significant object to determine which is the nearest to the template ratio. Next stage is image classification, using template matching with KNN ways to find the tiny area of the image to complement with the template image.

After corresponding done, it'll automatically create a sound from the computer with WAV record.

Flow Chart


Acquire image from webcam

Perform median filtering

Colour Space Conversion

Thresholding using Otsu

Image labelling

Find the greatest object

Image Representation

-roundness ratio

Template matching using KNN

Is the image matched?



Generate Sound

Figure 2: Movement chart of Image to Tone converter.

Based on Number 2, prior to the beginning of image reputation, first, acquire a graphic before the webcam, and then your bought image will go through image development process to perform median filtering to filtering some unwanted sound and sharpening the image. After that, the image will execute a colour space transformation which is convert the image color space to some other shade space, i. e. RGB, HSV, YCbCr and etc. The goal of converting the color space is to ensure that the converted image to be as same as the possible to the initial image. Next, perform a threshold strategy using Otsu's solution to calculating a way of measuring spread for the pixel levels each aspect of the threshold. The reason to do this is to separate the objects from the backdrop. Once the thresholding technique is performed, perform a image labelling by taking the exterior lines in the image and label them as occluding the background. After that, find the largest object and do the image representation using roundness proportion to assess which object is similar to the template ratio. Then, perform a template matching ways to find a match between your template and some of the image. The template that a lot of closely matches the object is then found using the KNN method to do a matching system with the repository image. If the data is matched, it will generate a audio automatically by using MATLAB to insert the wav file from the computer or laptop. After that, it will repeat the task starting from the first step. If the data is unmatched, it won't generate a audio and it will get back to the first rung on the ladder and repeat the task again before data is matched.

Project's Method

Median Filter

Median filters are nonlinear rank-order filter systems based on swapping each component of the source vector with the median value, taken over the preset neighbourhood of the processed element. These filter systems are widely used in image and sign processing applications. The purpose of median filtering is to cleans away impulsive noises, while keeping the transmission blurring to the minimum[18].

Otsu' Method

Otsu's method is a trusted method of segmentation, also known as the maximum infra-class variance method or the minimum inter-class variance method. This technique requires iterating through all the possible threshold prices and determining a measure of spread for the pixel levels each part of the threshold, i. e. the pixels that either comes in foreground or record. The goal is to find the threshold value where the amount of foreground and background spreads reaches its least[11].

Roundness Proportion/Circularity

Roundness is defined as a condition of an surface of revolution like cylinder, cone or sphere where all details of the top intersected by any aircraft perpendicular to a standard axis in case of cylinder and cone. Because the axis and centre do not are present bodily, measurements have to make with reference to areas of the numbers of trend only. For measuring roundness, it is only the circularity of the contour which is driven[12].

Template Matching

The classical template complementing method is charactered as easy mechanism, high accuracy and reliability of detection, and is employed as an over-all model evaluation and problem estimation. Therefore, it plays an essential role in image handling, and is widely used in object recognition and recognition. It is a technique for finding small elements of an image to complement with a data source image[14].

K-Nearest Neighbour (KNN)

K-Nearest Neighbour (KNN) is a branch of simple classification and regression algorithms. It can be thought as a sluggish method. It does not use working out data factors to do any generalization. Although classification remains the primary program of KNN, it may use to do thickness estimation also. Since KNN is non parametric, it can do calculation for arbitrary assignation[19].

Project Specification

This task is split into 3 main parts that happen to be hardware, software and job estimate cost.


The hardware was using because of this job is Logitech HD Webcam C310, below is the basic requirement of the webcam:

logitech-hd-webcam-c310. png

Figure 3: Logitech HD Webcam C310[20]

Windows Vista, Windows 7 (32-little bit or 64-little bit) or Windows 8

1 GHz

512 MB RAM or more

200MB hard drive space

Internet connection

USB 1. 1 slot (2. 0 advised)


The software because of this task is using MATLAB for image reputation and sound era.

Project Estimate Cost

The estimate cost because of this job is RM89 which was the Logitech HD Webcam C310, because this job was essentially software based project and the software to be utilized is MATLAB from college engineering laboratory.

Gantt Chart

More than 7 000 students trust us to do their work
90% of customers place more than 5 orders with us
Special price $5 /page
Check the price
for your assignment