In this series of articles, Part 1 covered using voice commands, Part 2 covered using voice diction, and Part 3 looked at their controls. Part 4 will cover making your application repeat what was dictated. There are two Controls that can be added, the Voice Text control or the Direct speech synthesis control. I will cover the Direct speech synthesis control because the Voice Text control is a simplified version of the Direct SS control.
As in the previous articles, I will only cover the basics, but give you the tools to do more.
What is required to do speech synthesis? Most systems already have the core synthesis engine loaded and do not require the SDK from Microsoft. However, as far as I can figure, the SDK adds to the functionality of this.
A Note on Voice Text Control: It is included in Windows 2K and XP, and also some versions of Visual Basic. The opperation of this control is simular to what is described below. Some of the Properties and Events have different names but use the same or similar Variables. If you're interested in using the Voice Text Control, the same basics as explained here will work. But, remember that some of the properties like Pitch and VolumeLeft/Right and events like Wordposition do not exsist in Voice Text Control and are available only in the Direct SS control. Methods like AudioPause, AudioResume, and AudioReset are simply called Pause, Resume, and Reset, respectively, but operate exactly the same. Events like Visual do not have the time Stamp, timehi, and timeloas variables.
How do you use the Direct SS control? Add the Microsoft Direct Text-to-Speech component to your project. If you leave the Visible property as True, the lips will synchronise to the text spoken, although personally I prefer to set it to False.
The control does not require any initialisation, or any other form of settings to operate. To have the control read out some text, you simply call the Speak function with the text you want to be spoken.
When called, the method immediately returns control back to your application and speaks the words in text1.text. This is very important to know; if you have several text pieces you want to have spoken, you can queue them one after another, and simply respond to the events of the SS control.
It is also advisable to disable the Voice Diction and Voice Commands when using the speak function because you may get unpredictable results.
After the control has completed speaking, it will call the Audiostop event. In this event, you can pick up that that the control has completed its task and taken the necessary actions—such as to re-enable the speak command button, moved on to the next page, and re-enabled voice diction.
Private Sub DirectSS1_AudioStop(ByVal hi As Long, ByVal lo As Long)
Command1.Enabled = True
While the control is busy, it will call the WordPosition event every time it starts a new word. You can use this event to synchronise the displaying of the words with the audio output.
Private Sub DirectSS1_WordPosition(ByVal hi As Long, _
ByVal lo As Long, _
ByVal byteoffset As Long)
Text1.SelStart = byteoffset / 2
Note: The byteoffset value is in Unicode and not ASCII, and requires a division of two to use it to set positions in a text box.
On the next page is a detailed list of the Controls Properties and methods.
In the Project download is a small VB application that uses the Direct SS control to repeat the text that has been dictated. It's a simplified version of the Dictapad that is installed with the SDK, but it shows the simple inclusion of the Direct Speech Synthesiser control into any application.
Here are the properties for the Direct Speech Synthesiser control.
Age (index As Long) As Long (Read only)
Aproximate Age of the voice. This property can be one of the following values:
TTSAGE_BABY: ~1 Year old.
TTSAGE_TODDLER: ~3 Years old.
TTSAGE_CHILD: ~6 Years old.
TTSAGE_ADOLESCENT: ~14 Years old.
TTSAGE_ADULT: 20–60 Years old.
TTSAGE_ELDERLY: OVER 60 Years old.
CallBacksEnabled As Integer
Uses the True, False settings like the voice command control. When CallBacksEnabled is set to False, events are not called.
CountEngines As Long (Read only)
Number of speech synthesis voices installed on this computer.
Note: This is the highest number that can be used as an index to indexed properties and methods.
CurrentMode As Long
Index for the currently selected voice.
Dialect (index As Long) As String (Read Only)
Dialect specific to the language.
Features (index As Long) As Long (Read Only)
Text-to-speech features that are available in the control. Valid settings (can be more than one):
TTSFEATURE_ANYWORD: The speech engine will attempt to read all words
TTSFEATURE_PCOPTOMIZED: The voice is optimised for use with computer speakers
TTSFEATURE_PHONEOPTIMIZED: The voice is optimised for use over the telephone with a 8Khz sampling rate
TTSFEATURE_TAGGED: The engine can interpret Tagged text to control the voice output
TTSFEATURE_VISUAL: The engine can provide mouth position information
FileName As String
When this variable is assigned to a filename, subsequent text-to-speech is recorded in a file of this File instead of played to the wave device. To re-enable the speakers and disable recording to a file, set FileName to "".
Gender (index As Long) As Long (Read only)
Gender of the voice.
HWnd as Long (Read only)
Initialized As Integer
Returns or sets the initialised state of the control. Most methods and properties will automatically initialise the control. Valid Settings: 0 = Not initialised, 1 = Initialised.
JawOpen As Integer
Angle to which the jaw is open. This is a linear range from &HFF for completely open to &H00 for completely closed.
LanguageID (index As Long) As Long (Read only)
Bits 0 through 9 identify the primary language, such as English, French, Spanish, and so on. Bits 10 through 15 indicate the sublanguage, which is essentially a locale setting.
LastError As Long (Read only)
Result code from the last method or property call.
LastWordPosition As Long
Offset, in bytes, from the beginning of the text-to-speech buffer to the word that is currently being played.
Note: This is the byte offset in Unicode.
LipTension As Integer
Lip tension. This is a linear range from &HFF if the lips are very tense to &H00 if they are completely relaxed.
LipType As Integer
When set to 0, red female lips are drawn. When set to 1, male pink lips are drawn.
MaxPitch As Long (Read only)
Maximum legal value for Pitch.
MaxSpeed As Long (Read only)
Maximum legal value for Speed.
MaxVolumeLeft As Long (Read only)
Maximum legal value for VolumeLeft.
MaxVolumeRight As Long (Read only)
Maximum legal value for VolumeRight.
MfgName (index As Long) As String (Read only)
Name of the engine manufacturer.
MinPitch As Long (Read only)
Minimum legal value for Pitch.
MinSpeed As Long (Read only)
Minimum legal value for Speed.
MinVolumeLeft As Long (Read only)
Minimum legal value for VolumeLeft.
MinVolumeRight As Long (Read only)
Minimum legal value for VolumeRight.
ModeName (index As Long) As String (Read Only)
Name of the text-to-speech mode.
MouthEnabled As Integer
When MouthEnabled is set to 0, the mouth does not animate.
MouthUpturn As Integer
Extent to which the mouth turns up at the corners (that is, how much it smiles). This is a linear range from &HFF for the maximum upturn (that is, the mouth is fully smiling) to &H00 if the corners of the mouth turn down. If this member is &H80, the mouth is neutral.
Pitch As Long
The current baseline pitch, in hertz, for a text-to-speech mode. The actual pitch of the voice typically fluctuates above this baseline. It usually does not go below it.
Speaker (index As Long) As String (Read only)
Name of the voice.
Speaking As Integer
Returns whether or not the synthesizer voice is speaking. When set to 1, the synthesizer voice is speaking. When set to 0, the synthesizer voice is not speaking.
Speed As Long
Sets or Returns the average speed for a text-to-speech mode, in words per minute.
SuppressExceptions As Integer
When set to 1, exceptions will never occur. You must check LastError to get the error code of the last method or property invocation.
TeethLowerVisible As Integer
Extent to which the lower teeth are visible. This is a linear range from &HFF for the maximum extent (that is, the lower teeth and gums are completely exposed) to &H00 for the minimum (the lower teeth are completely hidden.) If this member is &H80, only the teeth are visible.
TeethUpperVisible As Integer
Extent to which the upper teeth are visible. This is a linear range from 0xFF for the maximum extent (that is, the upper teeth and gums are completely exposed) to 0x00 for the minimum (the upper teeth are completely hidden). If this member is 0x80, only the teeth are visible.
TonguePosn As Integer
Tongue position. This a linear range from &HFF if the tongue is against the upper teeth, to &H00 if it is relaxed. If this member is &H80, the tongue is visible.
VolumeLeft As Long
Sets or returns the current volume for the left channel of text-to-speech mode.
VolumeRight As Long
Sets or returns the current volume for the right channel of text-to-speech mode.
AboutDlg (hWnd As Long, title As String)
Displays an About dialog box that identifies the text-to-speech engine and contains the copyright notice.
Note: If an application calls AudioPause and then TextData, the data will be queued up so that when AudioResume is called, there will be no latency. Applications can use this to ensure that the text-to-speech engine will speak right away.
Stops speech and cancels all queued speech data. When the queue is empty, the engine calls the TextDataDone event.
Resumes text-to-speech output that has been paused.
GeneralDlg (hWnd As Long, title As String)
Displays a General dialog box that gives the user general control of the text-to-speech engine and gives the user access to engine-specific controls.
GetPronunciation (CharSet As Long, Text As Long , Sense As Long , Pronounce As String, PartOfSpeech As Long , EngineInfo As String)
Returns pronunciation information for Text in Sense, Pronounce, PartOfSpeech, and EngineInfo.
LexiconDlg (hWnd As Long, title As String)
Displays a dialog box that allows the speaker to view and edit his or her pronunciation. For example, the speaker can edit the phonetics of mispronounced words.
Select (index As Long)
This selects a text-to-speech engine that Speak will use. See the CountEngines propertiy for more info.
Speak (text As String)
This causes text to speech to speak the text. By default, the Microsoft female voice is played on the wave device. Speak is asynchronous; that is, the method returns before all of the text is played.
TextData (characterset As Long, flags As Long, text As String)
Starts the process of converting text into audio data to be spoken. Same As Speak, but lets you set more flags. TextData is asynchronous; that is, the method returns before all of the text is played.
TranslateDlg (hWnd As Long, title As String)
Displays a Translation dialog box that lets the user control symbols, currencies, abbreviations, and number-translation techniques.
And lastly, the Events.
AttribChanged (which_attribute As Long)
Is called when an engine attribute has changed. Valid responses:
TTSNSAC_LANGUAGE: Indicates that the language has changed
TTSNSAC_MODE: Indicates that the mode or voice has changed
TTSNSAC_PITCH: Indicates that the baseline pitch for the voice
TTSNSAC_SPEED: Indicates that the baseline average speed for the voice
TTSNSAC_VOLUME: Indicates that the baseline volume for the voice
AudioStart (hi As Long, lo As Long)
Is called when audio data starts playing.
AudioStop (hi As Long, lo As Long)
Is called when audio data stops playing.
ClickIn (x As Long, y As Long)
Is called when the user clicks in the object's icon.
TextDataDone (hi As Long, lo As Long, Flags As Long)
Is called when text to speech data processing ends. TextDataDone is called once per TextData call. The Flags return the reason the processing has ended.
TextDataStarted (hi As Long, lo As Long)
Is called when text to speech data processing begins. TextDataStarted is called once per TextData call.
Visual (timehi As Long, timelo As Long, Phoneme As Integer, EnginePhoneme As Integer, hints As Long, MouthHeight As Integer, bMouthWidth As Integer, bMouthUpturn As Integer, bJawOpen As Integer, TeethUpperVisible As Integer, TeethLowerVisible As Integer, TonguePosn As Integer, LipTension As Integer)
Is called whenever the shape of the mouth should change. Also notifies an application which phoneme is being used in the current digital-audio stream. This allows users to implement their own mouths.
WordPosition (hi As Long, low As Long, byteoffset As Long)
Notifies the application of the word that is currently being played. Used for synchronization. An application can use the information returned by WordPosition to highlight the word being played.
About the Author
Richard Newcombe has been involved in computers since the time of the Commodore 64. Today, he has excelled in programming, and designs.
Richard is in his mid 30's and, if or when you looking for him look no further than his computer. Always willing to help and give advice where he can in regard to computer related subjects.
At present he is working as a .NET 2008 Software Developer for Syncrony Web Services, South Africa.