Members
(static) AudioEncoding :number
The encoding of the audio data sent in the request.
All encodings support only 1 channel (mono) audio.
For best results, the audio source should be captured and transmitted using
a lossless encoding (FLAC
or LINEAR16
). The accuracy of the speech
recognition can be reduced if lossy codecs are used to capture or transmit
audio, particularly if background noise is present. Lossy codecs include
MULAW
, AMR
, AMR_WB
, OGG_OPUS
, and SPEEX_WITH_HEADER_BYTE
.
The FLAC
and WAV
audio file formats include a header that describes the
included audio content. You can request recognition for WAV
files that
contain either LINEAR16
or MULAW
encoded audio.
If you send FLAC
or WAV
audio file format in
your request, you do not need to specify an AudioEncoding
; the audio
encoding format is determined from the file header. If you specify
an AudioEncoding
when you send send FLAC
or WAV
audio, the
encoding configuration must match the encoding described in the audio
header; otherwise the request returns an
google.rpc.Code.INVALID_ARGUMENT error
code.
Properties:
Name | Type | Description |
---|---|---|
ENCODING_UNSPECIFIED |
number |
Not specified. |
LINEAR16 |
number |
Uncompressed 16-bit signed little-endian samples (Linear PCM). |
FLAC |
number |
|
MULAW |
number |
8-bit samples that compand 14-bit audio samples using G.711 PCMU/mu-law. |
AMR |
number |
Adaptive Multi-Rate Narrowband codec. |
AMR_WB |
number |
Adaptive Multi-Rate Wideband codec. |
OGG_OPUS |
number |
Opus encoded audio frames in Ogg container
(OggOpus).
|
SPEEX_WITH_HEADER_BYTE |
number |
Although the use of lossy encodings is not recommended, if a very low
bitrate encoding is required, |
MP3 |
number |
MP3 audio. Support all standard MP3 bitrates (which range from 32-320
kbps). When using this encoding, |
(static) InteractionType :number
Use case categories that the audio recognition request can be described by.
Properties:
Name | Type | Description |
---|---|---|
INTERACTION_TYPE_UNSPECIFIED |
number |
Use case is either unknown or is something other than one of the other values below. |
DISCUSSION |
number |
Multiple people in a conversation or discussion. For example in a meeting with two or more people actively participating. Typically all the primary people speaking would be in the same room (if not, see PHONE_CALL) |
PRESENTATION |
number |
One or more persons lecturing or presenting to others, mostly uninterrupted. |
PHONE_CALL |
number |
A phone-call or video-conference in which two or more people, who are not in the same room, are actively participating. |
VOICEMAIL |
number |
A recorded message intended for another person to listen to. |
PROFESSIONALLY_PRODUCED |
number |
Professionally produced audio (eg. TV Show, Podcast). |
VOICE_SEARCH |
number |
Transcribe spoken questions and queries into text. |
VOICE_COMMAND |
number |
Transcribe voice commands, such as for controlling a device. |
DICTATION |
number |
Transcribe speech to text to create a written document, such as a text-message, email or report. |
(static) MicrophoneDistance :number
Enumerates the types of capture settings describing an audio file.
Properties:
Name | Type | Description |
---|---|---|
MICROPHONE_DISTANCE_UNSPECIFIED |
number |
Audio type is not known. |
NEARFIELD |
number |
The audio was captured from a closely placed microphone. Eg. phone, dictaphone, or handheld microphone. Generally if there speaker is within 1 meter of the microphone. |
MIDFIELD |
number |
The speaker if within 3 meters of the microphone. |
FARFIELD |
number |
The speaker is more than 3 meters away from the microphone. |
(static) OriginalMediaType :number
The original media the speech was recorded on.
Properties:
Name | Type | Description |
---|---|---|
ORIGINAL_MEDIA_TYPE_UNSPECIFIED |
number |
Unknown original media type. |
AUDIO |
number |
The speech data is an audio recording. |
VIDEO |
number |
The speech data originally recorded on a video. |
(static) RecordingDeviceType :number
The type of device the speech was recorded with.
Properties:
Name | Type | Description |
---|---|---|
RECORDING_DEVICE_TYPE_UNSPECIFIED |
number |
The recording device is unknown. |
SMARTPHONE |
number |
Speech was recorded on a smartphone. |
PC |
number |
Speech was recorded using a personal computer or tablet. |
PHONE_LINE |
number |
Speech was recorded over a phone line. |
VEHICLE |
number |
Speech was recorded in a vehicle. |
OTHER_OUTDOOR_DEVICE |
number |
Speech was recorded outdoors. |
OTHER_INDOOR_DEVICE |
number |
Speech was recorded indoors. |
(static) SpeechEventType :number
Indicates the type of speech event.
Properties:
Name | Type | Description |
---|---|---|
SPEECH_EVENT_UNSPECIFIED |
number |
No speech event specified. |
END_OF_SINGLE_UTTERANCE |
number |
This event indicates that the server has detected the end of the user's
speech utterance and expects no additional speech. Therefore, the server
will not process additional audio (although it may subsequently return
additional results). The client should stop sending additional audio
data, half-close the gRPC connection, and wait for any additional results
until the server closes the gRPC connection. This event is only sent if
|
Type Definitions
LongRunningRecognizeMetadata
Describes the progress of a long-running LongRunningRecognize
call. It is
included in the metadata
field of the Operation
returned by the
GetOperation
call of the google::longrunning::Operations
service.
Properties:
Name | Type | Description |
---|---|---|
progressPercent |
number |
Approximate percentage of audio processed thus far. Guaranteed to be 100 when the audio is fully processed and the results are available. |
startTime |
Object |
Time when the request was received. This object should have the same structure as Timestamp |
lastUpdateTime |
Object |
Time of the most recent processing update. This object should have the same structure as Timestamp |
- Source:
- See:
LongRunningRecognizeRequest
The top-level message sent by the client for the LongRunningRecognize
method.
Properties:
Name | Type | Description |
---|---|---|
config |
Object |
Required Provides information to the recognizer that specifies how to process the request. This object should have the same structure as RecognitionConfig |
audio |
Object |
Required The audio data to be recognized. This object should have the same structure as RecognitionAudio |
- Source:
- See:
LongRunningRecognizeResponse
The only message returned to the client by the LongRunningRecognize
method.
It contains the result as zero or more sequential SpeechRecognitionResult
messages. It is included in the result.response
field of the Operation
returned by the GetOperation
call of the google::longrunning::Operations
service.
Properties:
Name | Type | Description |
---|---|---|
results |
Array.<Object> |
Output only. Sequential list of transcription results corresponding to sequential portions of audio. This object should have the same structure as SpeechRecognitionResult |
- Source:
- See:
RecognitionAudio
Contains audio data in the encoding specified in the RecognitionConfig
.
Either content
or uri
must be supplied. Supplying both or neither
returns google.rpc.Code.INVALID_ARGUMENT.
See content limits.
Properties:
Name | Type | Description |
---|---|---|
content |
Buffer |
The audio data bytes encoded as specified in
|
uri |
string |
URI that points to a file that contains audio data bytes as specified in
|
- Source:
- See:
RecognitionConfig
Provides information to the recognizer that specifies how to process the request.
Properties:
Name | Type | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
encoding |
number |
Encoding of audio data sent in all The number should be among the values of AudioEncoding |
||||||||||
sampleRateHertz |
number |
Sample rate in Hertz of the audio data sent in all
|
||||||||||
audioChannelCount |
number |
Optional The number of channels in the input audio data.
ONLY set this for MULTI-CHANNEL recognition.
Valid values for LINEAR16 and FLAC are |
||||||||||
enableSeparateRecognitionPerChannel |
boolean |
This needs to be set to ‘true’ explicitly and |
||||||||||
languageCode |
string |
Required The language of the supplied audio as a BCP-47 language tag. Example: "en-US". See Language Support for a list of the currently supported language codes. |
||||||||||
alternativeLanguageCodes |
Array.<string> |
Optional A list of up to 3 additional BCP-47 language tags, listing possible alternative languages of the supplied audio. See Language Support for a list of the currently supported language codes. If alternative languages are listed, recognition result will contain recognition in the most likely language detected including the main language_code. The recognition result will include the language tag of the language detected in the audio. Note: This feature is only supported for Voice Command and Voice Search use cases and performance may vary for other use cases (e.g., phone call transcription). |
||||||||||
maxAlternatives |
number |
Optional Maximum number of recognition hypotheses to be returned.
Specifically, the maximum number of |
||||||||||
profanityFilter |
boolean |
Optional If set to |
||||||||||
speechContexts |
Array.<Object> |
Optional array of SpeechContext. A means to provide context to assist the speech recognition. For more information, see Phrase Hints. This object should have the same structure as SpeechContext |
||||||||||
enableWordTimeOffsets |
boolean |
Optional If |
||||||||||
enableWordConfidence |
boolean |
Optional If |
||||||||||
enableAutomaticPunctuation |
boolean |
Optional If 'true', adds punctuation to recognition result hypotheses. This feature is only available in select languages. Setting this for requests in other languages has no effect at all. The default 'false' value does not add punctuation to result hypotheses. Note: This is currently offered as an experimental service, complimentary to all users. In the future this may be exclusively available as a premium feature. |
||||||||||
enableSpeakerDiarization |
boolean |
Optional If 'true', enables speaker detection for each recognized word in the top alternative of the recognition result using a speaker_tag provided in the WordInfo. Note: Use diarization_config instead. |
||||||||||
diarizationSpeakerCount |
number |
Optional If set, specifies the estimated number of speakers in the conversation. Defaults to '2'. Ignored unless enable_speaker_diarization is set to true. Note: Use diarization_config instead. |
||||||||||
diarizationConfig |
Object |
Optional Config to enable speaker diarization and set additional parameters to make diarization better suited for your application. Note: When this is enabled, we send all the words from the beginning of the audio for the top alternative in every consecutive STREAMING responses. This is done in order to improve our speaker tags as our models learn to identify the speakers in the conversation over time. For non-streaming requests, the diarization results will be provided only in the top alternative of the FINAL SpeechRecognitionResult. This object should have the same structure as SpeakerDiarizationConfig |
||||||||||
metadata |
Object |
Optional Metadata regarding this request. This object should have the same structure as RecognitionMetadata |
||||||||||
model |
string |
Optional Which model to select for the given request. Select the model best suited to your domain to get best results. If a model is not explicitly specified, then we auto-select a model based on the parameters in the RecognitionConfig.
|
||||||||||
useEnhanced |
boolean |
Optional Set to true to use an enhanced model for speech recognition.
If If |
- Source:
- See:
RecognitionMetadata
Description of audio data to be recognized.
Properties:
Name | Type | Description |
---|---|---|
interactionType |
number |
The use case most closely describing the audio content to be recognized. The number should be among the values of InteractionType |
industryNaicsCodeOfAudio |
number |
The industry vertical to which this speech recognition request most closely applies. This is most indicative of the topics contained in the audio. Use the 6-digit NAICS code to identify the industry vertical - see https://www.naics.com/search/. |
microphoneDistance |
number |
The audio type that most closely describes the audio being recognized. The number should be among the values of MicrophoneDistance |
originalMediaType |
number |
The original media the speech was recorded on. The number should be among the values of OriginalMediaType |
recordingDeviceType |
number |
The type of device the speech was recorded with. The number should be among the values of RecordingDeviceType |
recordingDeviceName |
string |
The device used to make the recording. Examples 'Nexus 5X' or 'Polycom SoundStation IP 6000' or 'POTS' or 'VoIP' or 'Cardioid Microphone'. |
originalMimeType |
string |
Mime type of the original audio file. For example |
obfuscatedId |
number |
Obfuscated (privacy-protected) ID of the user, to identify number of unique users using the service. |
audioTopic |
string |
Description of the content. Eg. "Recordings of federal supreme court hearings from 2012". |
- Source:
- See:
RecognizeRequest
The top-level message sent by the client for the Recognize
method.
Properties:
Name | Type | Description |
---|---|---|
config |
Object |
Required Provides information to the recognizer that specifies how to process the request. This object should have the same structure as RecognitionConfig |
audio |
Object |
Required The audio data to be recognized. This object should have the same structure as RecognitionAudio |
- Source:
- See:
RecognizeResponse
The only message returned to the client by the Recognize
method. It
contains the result as zero or more sequential SpeechRecognitionResult
messages.
Properties:
Name | Type | Description |
---|---|---|
results |
Array.<Object> |
Output only. Sequential list of transcription results corresponding to sequential portions of audio. This object should have the same structure as SpeechRecognitionResult |
- Source:
- See:
SpeakerDiarizationConfig
Optional Config to enable speaker diarization.
Properties:
Name | Type | Description |
---|---|---|
enableSpeakerDiarization |
boolean |
Optional If 'true', enables speaker detection for each recognized word in the top alternative of the recognition result using a speaker_tag provided in the WordInfo. |
minSpeakerCount |
number |
Optional Minimum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 2. |
maxSpeakerCount |
number |
Optional Maximum number of speakers in the conversation. This range gives you more flexibility by allowing the system to automatically determine the correct number of speakers. If not set, the default value is 6. |
- Source:
- See:
SpeechContext
Provides "hints" to the speech recognizer to favor specific words and phrases in the results.
Properties:
Name | Type | Description |
---|---|---|
phrases |
Array.<string> |
Optional A list of strings containing words and phrases "hints" so that the speech recognition is more likely to recognize them. This can be used to improve the accuracy for specific words and phrases, for example, if specific commands are typically spoken by the user. This can also be used to add additional words to the vocabulary of the recognizer. See usage limits. List items can also be set to classes for groups of words that represent common concepts that occur in natural language. For example, rather than providing phrase hints for every month of the year, using the $MONTH class improves the likelihood of correctly transcribing audio that includes months. |
boost |
number |
Hint Boost. Positive value will increase the probability that a specific
phrase will be recognized over other similar sounding phrases. The higher
the boost, the higher the chance of false positive recognition as well.
Negative boost values would correspond to anti-biasing. Anti-biasing is not
enabled, so negative boost will simply be ignored. Though |
- Source:
- See:
SpeechRecognitionAlternative
Alternative hypotheses (a.k.a. n-best list).
Properties:
Name | Type | Description |
---|---|---|
transcript |
string |
Output only. Transcript text representing the words that the user spoke. |
confidence |
number |
Output only. The confidence estimate between 0.0 and 1.0. A higher number
indicates an estimated greater likelihood that the recognized words are
correct. This field is set only for the top alternative of a non-streaming
result or, of a streaming result where |
words |
Array.<Object> |
Output only. A list of word-specific information for each recognized word.
Note: When This object should have the same structure as WordInfo |
- Source:
- See:
SpeechRecognitionResult
A speech recognition result corresponding to a portion of the audio.
Properties:
Name | Type | Description |
---|---|---|
alternatives |
Array.<Object> |
Output only. May contain one or more recognition hypotheses (up to the
maximum specified in This object should have the same structure as SpeechRecognitionAlternative |
channelTag |
number |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'. |
languageCode |
string |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
- Source:
- See:
StreamingRecognitionConfig
Provides information to the recognizer that specifies how to process the request.
Properties:
Name | Type | Description |
---|---|---|
config |
Object |
Required Provides information to the recognizer that specifies how to process the request. This object should have the same structure as RecognitionConfig |
singleUtterance |
boolean |
Optional If If |
interimResults |
boolean |
Optional If |
- Source:
- See:
StreamingRecognitionResult
A streaming speech recognition result corresponding to a portion of the audio that is currently being processed.
Properties:
Name | Type | Description |
---|---|---|
alternatives |
Array.<Object> |
Output only. May contain one or more recognition hypotheses (up to the
maximum specified in This object should have the same structure as SpeechRecognitionAlternative |
isFinal |
boolean |
Output only. If |
stability |
number |
Output only. An estimate of the likelihood that the recognizer will not
change its guess about this interim result. Values range from 0.0
(completely unstable) to 1.0 (completely stable).
This field is only provided for interim results ( |
resultEndTime |
Object |
Output only. Time offset of the end of this result relative to the beginning of the audio. This object should have the same structure as Duration |
channelTag |
number |
For multi-channel audio, this is the channel number corresponding to the recognized result for the audio from that channel. For audio_channel_count = N, its output values can range from '1' to 'N'. |
languageCode |
string |
Output only. The BCP-47 language tag of the language in this result. This language code was detected to have the most likelihood of being spoken in the audio. |
- Source:
- See:
StreamingRecognizeRequest
The top-level message sent by the client for the StreamingRecognize
method.
Multiple StreamingRecognizeRequest
messages are sent. The first message
must contain a streaming_config
message and must not contain audio
data.
All subsequent messages must contain audio
data and must not contain a
streaming_config
message.
Properties:
Name | Type | Description |
---|---|---|
streamingConfig |
Object |
Provides information to the recognizer that specifies how to process the
request. The first This object should have the same structure as StreamingRecognitionConfig |
audioContent |
Buffer |
The audio data to be recognized. Sequential chunks of audio data are sent
in sequential |
- Source:
- See:
StreamingRecognizeResponse
StreamingRecognizeResponse
is the only message returned to the client by
StreamingRecognize
. A series of zero or more StreamingRecognizeResponse
messages are streamed back to the client. If there is no recognizable
audio, and single_utterance
is set to false, then no messages are streamed
back to the client.
Here's an example of a series of ten StreamingRecognizeResponse
s that might
be returned while processing audio:
-
results { alternatives { transcript: "tube" } stability: 0.01 }
-
results { alternatives { transcript: "to be a" } stability: 0.01 }
-
results { alternatives { transcript: "to be" } stability: 0.9 } results { alternatives { transcript: " or not to be" } stability: 0.01 }
-
results { alternatives { transcript: "to be or not to be" confidence: 0.92 } alternatives { transcript: "to bee or not to bee" } is_final: true }
-
results { alternatives { transcript: " that's" } stability: 0.01 }
-
results { alternatives { transcript: " that is" } stability: 0.9 } results { alternatives { transcript: " the question" } stability: 0.01 }
-
results { alternatives { transcript: " that is the question" confidence: 0.98 } alternatives { transcript: " that was the question" } is_final: true }
Notes:
-
Only two of the above responses #4 and #7 contain final results; they are indicated by
is_final: true
. Concatenating these together generates the full transcript: "to be or not to be that is the question". -
The others contain interim
results
. #3 and #6 contain two interimresults
: the first portion has a high stability and is less likely to change; the second portion has a low stability and is very likely to change. A UI designer might choose to show only high stabilityresults
. -
The specific
stability
andconfidence
values shown above are only for illustrative purposes. Actual values may vary. -
In each response, only one of these fields will be set:
error
,speech_event_type
, or one or more (repeated)results
.
Properties:
Name | Type | Description |
---|---|---|
error |
Object |
Output only. If set, returns a google.rpc.Status message that specifies the error for the operation. This object should have the same structure as Status |
results |
Array.<Object> |
Output only. This repeated list contains zero or more results that
correspond to consecutive portions of the audio currently being processed.
It contains zero or one This object should have the same structure as StreamingRecognitionResult |
speechEventType |
number |
Output only. Indicates the type of speech event. The number should be among the values of SpeechEventType |
- Source:
- See:
WordInfo
Word-specific information for recognized words.
Properties:
Name | Type | Description |
---|---|---|
startTime |
Object |
Output only. Time offset relative to the beginning of the audio,
and corresponding to the start of the spoken word.
This field is only set if This object should have the same structure as Duration |
endTime |
Object |
Output only. Time offset relative to the beginning of the audio,
and corresponding to the end of the spoken word.
This field is only set if This object should have the same structure as Duration |
word |
string |
Output only. The word corresponding to this set of information. |
confidence |
number |
Output only. The confidence estimate between 0.0 and 1.0. A higher number
indicates an estimated greater likelihood that the recognized words are
correct. This field is set only for the top alternative of a non-streaming
result or, of a streaming result where |
speakerTag |
number |
Output only. A distinct integer value is assigned for every speaker within the audio. This field specifies which one of those speakers was detected to have spoken this word. Value ranges from '1' to diarization_speaker_count. speaker_tag is set if enable_speaker_diarization = 'true' and only in the top alternative. |