Skip to content
Snippets Groups Projects
Unverified Commit e800454d authored by Marcin Wątroba's avatar Marcin Wątroba
Browse files

Change model and update pipeline

parent e7a1f7ac
Branches
No related tags found
1 merge request!13Change data model
Showing
with 1974 additions and 0 deletions
asr-benchmarks==0.0.1-alpha.48
speechbrain
FROM python:3.9-slim
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update \
&& apt-get dist-upgrade -y \
&& apt-get install -y --no-install-recommends \
build-essential \
portaudio19-dev \
openssh-client \
python3-pip \
python3-dev \
&& apt-get clean \
&& rm -fr /var/lib/apt/lists/* \
&& rm -fr /var/cache/apt/*
ADD ./python /dictation_client
WORKDIR /dictation_client
RUN pip3 install -i https://pypi.clarin-pl.eu/simple -r requirements.txt
CMD ["./run_web_service.sh"]
version: "3.8"
services:
techmo_asr:
image: docker-registry.theliver.pl/techmo-asr:1.1
container_name: techmo_asr
environment:
- TECHMO_SSH_SERVER_USERNAME=mwatroba
- TECHMO_SSH_SERVER_URL=jankocon.clarin-pl.eu
- TECHMO_SERVER_SSH_PORT=9222
- TECHMO_REMOTE_SERVICE_PORT=12321
- TECHMO_SERVER_URL=156.17.135.34
- AUTH_TOKEN=test1234
volumes:
- /Users/marcinwatroba/Desktop/WUST/KEYS/techmo_asr_server:/keys
#!/bin/bash
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
docker build -t techmo-asr "$SCRIPT_DIR"
docker tag techmo-asr docker-registry.theliver.pl/techmo-asr:1.1
docker push docker-registry.theliver.pl/techmo-asr:1.1
# Python implementation of Dictation ASR gRPC client.
## Docker usage
### Build docker image
To prepare a docker image with Python implementation of the Dictation Client, open the project's main directory and run following command:
```
docker build -f Dockerfile-python -t dictation-client-python:2.3.0 .
```
The build process will take several minutes.
When the build process is complete, you will receive a message:
```
Successfully tagged dictation-client-python:2.3.0
```
### Run Dictation client
To use the Dictation Client on Docker container, go to the `dictation-client/python/docker` directory and run `run_dictation_client_python.sh` script.
To send a simple request to the Dictation service, use:
```
./run_dictation_client_python.sh --service-address IP_ADDRESS:PORT --filename WAV_FILE_NAME
```
To print the list of available options, use:
```
./run_dictation_client_python.sh --help
```
Audio files to be transcribed should be placed inside the `dictation-client/python/docker/wav` directory.
TLS credentials should be placed inside the `dictation-client/python/docker/tls` directory, if used.
## Local instance usage
### Basic Usage
Dictation Client includes a scripts for automatic environment configuration and launching on systems from the Debian Linux family. For launching the Dictation Client on other Linux-based OS or Windows, check out the "Manual Usage" section.
#### Before run
To install required dependencies and to prepare virtual environment, run:
```
./setup.sh
```
#### Run
To run the Dictation Client, use the `run.sh` script, e.g.:
```
./run --service-address IP_ADDRESS:PORT --wave-path INPUT_WAVE
```
To print the usage description, use:
```
./run --help
```
### Manual Usage
#### Before run
##### Submodules
After cloning a git repository, download submodules:
```
git submodule update --init --recursive
```
(this command has to be invoked from the project's root directory)
If you are not using git, you have to manually download `googleapis` submodule.
To do this, open project repository in web browser, go to the `submodules` directory and use the link located there to open the relevant commit in the googleapis repository. Then download it, unpack and copy all files to the `submodules/googleapis` directory.
##### Dependencies
If you don't have virtualenv yet, install it first (https://virtualenv.pypa.io/en/stable/installation.html)
On Debian/Ubuntu OS this package can be installed by using `setup.sh` script.
Then install the required dependencies inside the virtual environment (this step only needs to be done the first time, for the further usage it is enough to use the existing virtual environment).
- On Linux:
Use Python 3 with the virtual environment and install required packages (supported Python versions are: 3.5, 3.6, 3.7, 3.8, 3.9):
```
virtualenv -p python3 .env
source .env/bin/activate
pip install -r requirements.txt
```
- On Windows 10:
Temporarily change the PowerShell's execution policy to allow scripting. Start the PowerShell with `Run as Administrator` and use command:
```
Set-ExecutionPolicy RemoteSigned
```
then confirm your choice.
Use Python 3 with virtual environment and install required packages (supported Python versions are: 3.5, 3.6, 3.7, 3.8, 3.9):
```
virtualenv -p python3 .env
.\.env\Scripts\activate
pip install -r requirements.txt
```
To switch back PowerShell's execution policy to the default, use command:
```
Set-ExecutionPolicy Restricted
```
##### Proto sources
[Optional] To regenerate the sources from `.proto`, run:
```
./make_proto.sh
```
This might be required when using other gRPC or Protocol Buffers version.
#### Run
To run the Dictation Client, activate the virtual environment first:
- On Linux:
```
source .env/bin/activate
```
- On Windows:
```
.\.env\Scripts\activate
```
Then run Dictation Client. Sample use:
```
python dictation_client.py --service-address "192.168.1.1:4321" --wave-path audio.wav
```
For each request you have to provide the service address and the audio source (wav file or microphone).
## Usage:
```
Basic usage: dictation_client.py --service-address ADDRESS --wave-path WAVE
```
Available options:
```
-h, --help show this help message and exit
--service-address ADDRESS
IP address and port (address:port) of a service the
client will connect to.
--ssl-dir SSL_DIRECTORY
If set to a path with ssl credential files
(client.crt, client.key, ca.crt), use ssl
authentication. Otherwise use insecure channel
(default).
--wave-path WAVE Path to wave file with speech to be recognized. Should
be mono, 8kHz or 16kHz.
--mic Use microphone as an audio source (instead of wave
file).
--session-id SESSION_ID
Session ID to be passed to the service. If not
specified, the service will generate a default session
ID itself.
--grpc-timeout GRPC_TIMEOUT
Timeout in milliseconds used to set gRPC deadline -
how long the client is willing to wait for a reply
from the server. If not specified, the service will
set the deadline to a very large number.
--max-alternatives MAX_ALTERNATIVES
Maximum number of recognition hypotheses to be
returned.
--time-offsets If set - the recognizer will return also word time
offsets.
--single-utterance If set - the recognizer will detect a single spoken
utterance.
--interim-results If set - messages with temporal results will be shown.
--no-input-timeout NO_INPUT_TIMEOUT
MRCP v2 no input timeout [ms].
--speech-complete-timeout SPEECH_COMPLETE_TIMEOUT
MRCP v2 speech complete timeout [ms].
--speech-incomplete-timeout SPEECH_INCOMPLETE_TIMEOUT
MRCP v2 speech incomplete timeout [ms].
--recognition-timeout RECOGNITION_TIMEOUT
MRCP v2 recognition timeout [ms].
--context-phrase CONTEXT_PHRASE
Specifies which context model to use.
```
## Troubleshooting
### Dependencies
If process of installing dependencies fails with the message similar to this one:
```
src/_portaudiomodule.c:28:10: fatal error: Python.h: No such file or directory
#include "Python.h"
^~~~~~~~~~
compilation terminated.
error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
```
it means that `python3-dev` package is missing.
On Debian/Ubuntu OS this package can be installed by using `setup.sh` script.
If process of installing dependencies fails with the message similar to this one:
```
src/_portaudiomodule.c:29:10: fatal error: portaudio.h: No such file or directory
#include "portaudio.h"
^~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
```
it means that PortAudio library is missing.
PortAudio can be downloaded from: http://www.portaudio.com/download.html.
On Debian/Ubuntu OS this package can be installed by using `setup.sh` script.
### Microphone
To use a microphone as the audio source instead of the wav file, use `--mic` option.
It allows to send audio data directly from the microphone, however it does not provide information when to finish the recognition.
For this reason in most cases `--mic` should be followed by the `--single-utterance` option, which stops the recognition after a first spotted utterance.
If the only output you receive is:
```
Received speech event type: END_OF_SINGLE_UTTERANCE
```
check if your microphone is connected and properly configured.
### ALSA Configuration
On the Linux operating systems using Advanced Linux Sound Architecture (ALSA) minor configuration changes may be necessary before the first use.
If you get the following output after runing request:
```
Dictation ASR gRPC client 2.3.0
ALSA lib pcm_dsnoop.c:618:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
ALSA lib pcm_route.c:867:(find_matching_chmap) Found no matching channel map
```
that means you need to modify the audio interfaces configuration.
In such case, open the `/usr/share/alsa/alsa.conf` file with root privileges, e.g.:
```
sudo vim /usr/share/alsa/alsa.conf
```
In the `# PCM interface` section find and comment (using #) all lines that defines interfaces marked as 'Unknown':
```
pcm.rear cards.pcm.rear
pcm.center_lfe cards.pcm.center_lfe
pcm.side cards.pcm.side
```
To get rid of warnings, comment also several lines below, starting with `pcm.surround`.
Then save and close the file.
### FFmpeg
If the FFmpeg framework is not installed, the following warning appears in the program output:
```
RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
```
Installing the FFmpeg framework is not necessary to run the application, however it may be a useful stuff for everyone working with the sound files.
FFmpeg can be downloaded from: https://ffmpeg.org/download.html
On the Ubuntu/Debian Linux OS you can download and install FFmpeg directly from the official repositories.
\ No newline at end of file
DICTATION_CLIENT_VERSION = '2.3.0'
#!/usr/bin/python3
from argparse import ArgumentParser
from typing import Tuple, List
from sziszapangma.integration.service_core.asr.asr_result import WordTimeAlignment
from VERSION import DICTATION_CLIENT_VERSION
from service.dictation_settings import DictationSettings
from service.streaming_recognizer import StreamingRecognizer
from utils.audio_source import AudioStream
from utils.mic_source import MicrophoneStream
def duration_to_normalised_millis(duration):
return int((duration.seconds * 1000000000 + duration.nanos) / 1000000)
def word_duration_to_dict(duration_word) -> WordTimeAlignment:
return WordTimeAlignment(duration_to_normalised_millis(duration_word[0]),
duration_to_normalised_millis(duration_word[1]))
def create_audio_stream(args):
# create wave file stream
if args.wave is not None:
return AudioStream(args.wave)
# create microphone stream
if args.mic:
rate = 16000 # [Hz]
chunk = int(rate / 10) # [100 ms]
return MicrophoneStream(rate, chunk)
# default
raise ValueError("Unknown media source to create")
def recognise(wav_path: str, return_time_offsets: bool) -> Tuple[str, List[WordTimeAlignment]]:
print("Dictation ASR gRPC client " + DICTATION_CLIENT_VERSION)
parser = ArgumentParser()
parser.add_argument("--service-address", dest="address", required=False,
default="127.0.0.1:12321",
help="IP address and port (address:port) of a service the client will connect to.",
type=str)
parser.add_argument("--ssl-dir", dest="ssl_directory", default="",
help="If set to a path with ssl credential files (client.crt, client.key, ca.crt), use ssl authentication. Otherwise use insecure channel (default).",
type=str)
parser.add_argument("--wave-path", dest="wave",
help="Path to wave file with speech to be recognized. Should be mono, 8kHz or 16kHz.",
default=wav_path)
parser.add_argument("--mic", help="Use microphone as an audio source (instead of wave file).",
action='store_true')
parser.add_argument("--session-id",
help="Session ID to be passed to the service. If not specified, the service will generate a default session ID itself.",
default="", type=str)
parser.add_argument("--grpc-timeout",
help="Timeout in milliseconds used to set gRPC deadline - how long the client is willing to wait for a reply from the server. If not specified, the service will set the deadline to a very large number.",
default=0, type=int)
# request configuration section
parser.add_argument("--max-alternatives",
help="Maximum number of recognition hypotheses to be returned.",
default=1, type=int)
parser.add_argument("--time-offsets",
help="If set - the recognizer will return also word time offsets.",
action="store_true", default=return_time_offsets)
parser.add_argument("--single-utterance",
help="If set - the recognizer will detect a single spoken utterance.",
action="store_true", default=False)
parser.add_argument("--interim-results",
help="If set - messages with temporal results will be shown.",
action="store_true", default=False)
# timeouts
parser.add_argument("--no-input-timeout", help="MRCP v2 no input timeout [ms].", default=5000,
type=int)
parser.add_argument("--speech-complete-timeout", help="MRCP v2 speech complete timeout [ms].",
default=2000,
type=int)
parser.add_argument("--speech-incomplete-timeout",
help="MRCP v2 speech incomplete timeout [ms].", default=4000,
type=int)
parser.add_argument("--recognition-timeout", help="MRCP v2 recognition timeout [ms].",
default=10000, type=int)
parser.add_argument("--context-phrase", help="Specifies which context model to use.",
default="", type=str)
# Stream audio to the ASR engine and print all hypotheses to standard output
args = parser.parse_args()
print('args')
print(args)
# if args.wave is not None or args.mic:
with create_audio_stream(args) as stream:
settings = DictationSettings(args)
recognizer = StreamingRecognizer(args.address, args.ssl_directory, settings)
print('Recognizing...')
results = recognizer.recognize(stream)
print(results)
return results[0]['transcript'], [
word_duration_to_dict(it) for it in results[0]['alignment']]
#!/bin/bash
# coding=utf-8
# This script sends request to dictation service using dictation client inside docker container
# Requires "dictation-client-python:2.3.0" docker image loaded locally
set -euo pipefail
IFS=$'\n\t'
SCRIPT=$(realpath "$0")
SCRIPTPATH=$(dirname "${SCRIPT}")
docker_image="dictation-client-python:2.3.0"
usage() {
echo "
Dictation ASR gRPC client 2.3.0
-h, --help show this help message and exit
-s=ADDRESS, --service-address=ADDRESS
IP address and port (address:port) of a service the client will connect to.
-f=WAVE, --filename=WAVE
Name of the wave file with speech to be recognized. File should be inside 'wav' directory. Should be mono, 8kHz or 16kHz.
-m, --mic Use microphone as an audio source (instead of wave file).
--tls If set, uses tls authentication, otherwise use insecure channel (default). The tls credential files (client.crt, client.key, ca.crt) should be placed inside 'tls' directory.
--session-id=SESSION_ID
Session ID to be passed to the service. If not specified, the service will generate a default session ID itself.
--grpc-timeout=GRPC_TIMEOUT
Timeout in milliseconds used to set gRPC deadline - how long the client is willing to wait for a reply from the
server. If not specified, the service will set the deadline to a very large number.
--max-alternatives=MAX_ALTERNATIVES
Maximum number of recognition hypotheses to be returned.
--time-offsets If set - the recognizer will return also word time offsets.
--single-utterance If set - the recognizer will detect a single spoken utterance.
--interim-results If set - messages with temporal results will be shown.
--no-input-timeout=NO_INPUT_TIMEOUT
MRCP v2 no input timeout [ms].
--speech-complete-timeout=SPEECH_COMPLETE_TIMEOUT
MRCP v2 speech complete timeout [ms].
--speech-incomplete-timeout=SPEECH_INCOMPLETE_TIMEOUT
MRCP v2 speech incomplete timeout [ms].
--recognition-timeout=RECOGNITION_TIMEOUT
MRCP v2 recognition timeout [ms].
--context-phrase=CONTEXT_PHRASE
Specifies which context model to use.
"
}
optspec=":fhms-:"
while getopts "f:hms:-:" optchar; do
case "${optchar}" in
-)
case "${OPTARG}" in
help)
usage; exit 0
;;
tls)
opts+=( "--ssl-dir" "/volumen/tls" )
;;
time-offsets)
opts+=( "--time-offsets" )
;;
single-utterance)
opts+=( "--single-utterance" )
;;
interim-results)
opts+=( "--interim-results" )
;;
mic)
opts+=("--mic")
;;
filename=*)
val=${OPTARG#*=}
opt=${OPTARG%=$val}
opts+=( "--wave-path" "/volumen/wav/${val##*/}" )
;;
*=*)
val=${OPTARG#*=}
opt=${OPTARG%=$val}
opts+=( "--$opt" "$val" )
;;
*)
if [ "$OPTERR" = 1 ] && [ "${optspec:0:1}" != ":" ]; then
echo "Unknown option --${OPTARG}" >&2
fi
;;
esac;;
f)
val=${OPTARG#*=}
opt=${OPTARG%=$val}
opts+=( "--wave-path" "/volumen/wav/${val##*/}" )
;;
h)
usage; exit 0
;;
m)
opts+=("--mic")
;;
s)
val=${OPTARG#*=}
opt=${OPTARG%=$val}
opts+=( "--service-address" "${val}" )
;;
*)
if [ "$OPTERR" != 1 ] || [ "${optspec:0:1}" = ":" ]; then
echo "Non-option argument: '-${OPTARG}'" >&2
fi
;;
esac
done
docker run --rm -it -v "${SCRIPTPATH}:/volumen" --network host "${docker_image}" \
python3 /dictation_client/dictation_client.py "${opts[@]}"
\ No newline at end of file
import os
import socket
from sziszapangma.integration.service_core.asr.asr_base_processor import AsrBaseProcessor
from sziszapangma.integration.service_core.asr.asr_result import AsrResult
from dictation_client import recognise
class TechmoAsrProcessor(AsrBaseProcessor):
@staticmethod
def is_tunnel_running() -> bool:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(('localhost', 12321)) == 0
def process_asr(self, audio_file_path: str) -> AsrResult:
print(f'processing start {audio_file_path}')
if not self.is_tunnel_running():
os.system('./start_tunneling.sh')
raw_transcription, words_time_alignment = recognise(audio_file_path, True)
transcription = raw_transcription.replace('\n', ' ')
words = [
it
for it in transcription.split(' ')
if it not in ['', ' ']
]
asr_result = AsrResult(words, transcription, words_time_alignment)
print(f'processing end {audio_file_path}, {asr_result}')
return asr_result
if __name__ == '__main__':
TechmoAsrProcessor().start_processor()
#!/bin/bash
# coding=utf-8
set -eo pipefail
virtualenv -p python3 proto_env
# shellcheck disable=SC1091
source proto_env/bin/activate
pip install grpcio-tools==1.7.0
function cleanup() {
# shellcheck disable=SC1091
rm -rf proto_env
}
trap cleanup EXIT
echo "Generating dictation Python protobuf/grpc sources."
path_i="../proto"
path_o="service"
python3 -m grpc_tools.protoc \
-I${path_i} \
-I../submodules/googleapis \
--python_out=${path_o} \
--grpc_python_out=${path_o} \
${path_i}/dictation_asr.proto
# Fix buggy autogenerated GRPC import
sed -i 's/.*import dictation_asr_pb2 as dictation__asr__pb2.*/from . import dictation_asr_pb2 as dictation__asr__pb2/' ${path_o}/dictation_asr_pb2_grpc.py
import os
_TECHMO_SSH_SERVER_KEY = 'TECHMO_SSH_SERVER_KEY'
_TECHMO_KEY_FILE = './techmo_key_file'
if __name__ == '__main__':
key_content = os.environ[_TECHMO_SSH_SERVER_KEY].replace('\\n', '\n')
with open(_TECHMO_KEY_FILE, 'w') as f:
f.write(key_content)
setuptools==50.3.2
grpcio==1.37.0
grpcio-tools==1.37.0
protobuf==3.12.2
gapic-google-cloud-speech-v1==0.15.3
grpc-google-cloud-speech-v1==0.8.1
proto-google-cloud-speech-v1==0.15.3
google-auth==1.21.1
google-auth-httplib2==0.0.3
google-cloud-core==1.0.2
google-cloud-speech==1.0.0
googleapis-common-protos==1.6.0
httplib2==0.14.0
oauth2client==2.0.0
pydub==0.23.1
pyaudio==0.2.11
asr-benchmarks==0.0.1-alpha.48
#!/bin/bash
# coding=utf-8
# This script sends request to dictation service using python dictation client
# Before using this script, run 'setup.sh' to check dependencies and prepare virtual environment
set -euo pipefail
IFS=$'\n\t'
SCRIPT=$(realpath "$0")
SCRIPTPATH=$(dirname "${SCRIPT}")
source "${SCRIPTPATH}/.env/bin/activate"
export PYTHONIOENCODING=utf8
python3 "${SCRIPTPATH}/dictation_client.py" "$@"
\ No newline at end of file
#!/bin/bash
#python -u prepare_key.py
#
#chmod 600 ./techmo_key_file
#./start_tunneling.sh
python -u main.py
This diff is collapsed.
# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
import grpc
from . import dictation_asr_pb2 as dictation__asr__pb2
class SpeechStub(object):
"""Service that implements Google Cloud Speech API extended by Techmo.
"""
def __init__(self, channel):
"""Constructor.
Args:
channel: A grpc.Channel.
"""
self.Recognize = channel.unary_unary(
'/google.cloud.speech.v1.Speech/Recognize',
request_serializer=dictation__asr__pb2.RecognizeRequest.SerializeToString,
response_deserializer=dictation__asr__pb2.RecognizeResponse.FromString,
)
self.StreamingRecognize = channel.stream_stream(
'/google.cloud.speech.v1.Speech/StreamingRecognize',
request_serializer=dictation__asr__pb2.StreamingRecognizeRequest.SerializeToString,
response_deserializer=dictation__asr__pb2.StreamingRecognizeResponse.FromString,
)
class SpeechServicer(object):
"""Service that implements Google Cloud Speech API extended by Techmo.
"""
def Recognize(self, request, context):
"""Performs synchronous speech recognition: receive results after all audio
has been sent and processed.
"""
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details('Method not implemented!')
raise NotImplementedError('Method not implemented!')
def StreamingRecognize(self, request_iterator, context):
"""Performs asynchronous speech recognition: receive results via the
google.longrunning.Operations interface. Returns either an
`Operation.error` or an `Operation.response` which contains
a `LongRunningRecognizeResponse` message.
rpc LongRunningRecognize(LongRunningRecognizeRequest) returns (google.longrunning.Operation) {
option (google.api.http) = {
post: "/v1/speech:longrunningrecognize"
body: "*"
};
}
Performs bidirectional streaming speech recognition: receive results while
sending audio. This method is only available via the gRPC API (not REST).
"""
context.set_code(grpc.StatusCode.UNIMPLEMENTED)
context.set_details('Method not implemented!')
raise NotImplementedError('Method not implemented!')
def add_SpeechServicer_to_server(servicer, server):
rpc_method_handlers = {
'Recognize': grpc.unary_unary_rpc_method_handler(
servicer.Recognize,
request_deserializer=dictation__asr__pb2.RecognizeRequest.FromString,
response_serializer=dictation__asr__pb2.RecognizeResponse.SerializeToString,
),
'StreamingRecognize': grpc.stream_stream_rpc_method_handler(
servicer.StreamingRecognize,
request_deserializer=dictation__asr__pb2.StreamingRecognizeRequest.FromString,
response_serializer=dictation__asr__pb2.StreamingRecognizeResponse.SerializeToString,
),
}
generic_handler = grpc.method_handlers_generic_handler(
'google.cloud.speech.v1.Speech', rpc_method_handlers)
server.add_generic_rpc_handlers((generic_handler,))
class DictationSettings:
"""Default settings for Techmo Dictation ASR (timeouts and thresholds)"""
def __init__(self, args):
# use configuration directly
self.args = args
def session_id(self):
return self.args.session_id
def grpc_timeout(self):
return self.args.grpc_timeout
def max_alternatives(self):
return self.args.max_alternatives
def time_offsets(self):
return self.args.time_offsets
def single_utterance(self):
return self.args.single_utterance
def interim_results(self):
return self.args.interim_results
def timeouts_map(self):
return {
"no-input-timeout": str(self.args.no_input_timeout),
"speech-complete-timeout": str(self.args.speech_complete_timeout),
"speech-incomplete-timeout": str(self.args.speech_incomplete_timeout),
"recognition-timeout": str(self.args.recognition_timeout),
}
def context_phrase(self):
return self.args.context_phrase
import os
import threading
from . import dictation_asr_pb2 as dictation_asr_pb2
from . import dictation_asr_pb2_grpc as dictation_asr_pb2_grpc
import grpc
class RequestIterator:
"""Thread-safe request iterator for streaming recognizer."""
def __init__(self, audio_stream, settings):
# Iterator data
self.audio_stream = audio_stream
self.audio_generator = self.audio_stream.generator()
self.settings = settings
self.request_builder = {
True: self._initial_request,
False: self._normal_request
}
# Iterator state
self.lock = threading.Lock()
self.is_initial_request = True
self.eos = False # indicates whether end of stream message was send (request to stop iterator)
def _initial_request(self):
req = StreamingRecognizer.build_configuration_request(self.audio_stream.frame_rate(), self.settings)
self.is_initial_request = False
return req
def _normal_request(self):
data = next(self.audio_generator)
if data == None:
raise StopIteration
return dictation_asr_pb2.StreamingRecognizeRequest(audio_content=data)
def __iter__(self):
return self
def __next__(self):
with self.lock:
return self.request_builder[self.is_initial_request]()
class StreamingRecognizer:
def __init__(self, address, ssl_directory, settings_args):
# Use ArgumentParser to parse settings
self.service = dictation_asr_pb2_grpc.SpeechStub(StreamingRecognizer.create_channel(address, ssl_directory))
self.settings = settings_args
def recognize(self, audio):
requests_iterator = RequestIterator(audio, self.settings)
return self.recognize_audio_content(requests_iterator)
def recognize_audio_content(self, requests_iterator):
time_offsets = self.settings.time_offsets()
timeout=None
if self.settings.grpc_timeout() > 0:
timeout = self.settings.grpc_timeout() / 1000 # milliseconds to seconds
metadata = []
if self.settings.session_id():
metadata = [('session_id', self.settings.session_id())]
recognitions = self.service.StreamingRecognize(requests_iterator, timeout=timeout, metadata=metadata)
confirmed_results = []
alignment = []
confidence = 1.0
for recognition in recognitions:
if recognition.error.code:
print(u"Received error response: ({}) {}".format(recognition.error.code, recognition.error.message))
requests_iterator.audio_stream.close()
elif recognition.speech_event_type != dictation_asr_pb2.StreamingRecognizeResponse.SPEECH_EVENT_UNSPECIFIED:
print(u"Received speech event type: {}".format(
dictation_asr_pb2.StreamingRecognizeResponse.SpeechEventType.Name(recognition.speech_event_type)))
requests_iterator.audio_stream.close()
# process response type
elif recognition.results is not None and len(recognition.results) > 0:
first = recognition.results[0]
if first.is_final:
if time_offsets:
for word in first.alternatives[0].words:
if word.word != '<eps>':
confirmed_results.append(word.word)
alignment.append([word.start_time, word.end_time])
else:
confirmed_results.append(first.alternatives[0].transcript)
confidence = min(confidence, first.alternatives[0].confidence)
else:
print(u"Temporal results - {}".format(first))
# build final results
final_alignment = [[]]
final_transc = ' '.join(confirmed_results)
if time_offsets and alignment:
final_alignment = alignment
return [{
'transcript': final_transc,
'alignment': final_alignment,
'confidence': confidence
}] # array with one element
@staticmethod
def create_channel(address, ssl_directory):
if not ssl_directory:
return grpc.insecure_channel(address)
def read_file(path):
with open(path, 'rb') as file:
return file.read()
return grpc.secure_channel(address, grpc.ssl_channel_credentials(
read_file(os.path.join(ssl_directory, 'ca.crt')),
read_file(os.path.join(ssl_directory, 'client.key')),
read_file(os.path.join(ssl_directory, 'client.crt')),
))
@staticmethod
def build_recognition_config(sampling_rate, settings):
recognition_config = dictation_asr_pb2.RecognitionConfig(
encoding='LINEAR16', # one of LINEAR16, FLAC, MULAW, AMR, AMR_WB
sample_rate_hertz=sampling_rate, # the rate in hertz
# See https://g.co/cloud/speech/docs/languages for a list of supported languages.
language_code='pl-PL', # a BCP-47 language tag
enable_word_time_offsets=settings.time_offsets(), # if true, return recognized word time offsets
max_alternatives=1, # maximum number of returned hypotheses
)
if (settings.context_phrase()):
speech_context = recognition_config.speech_contexts.add()
speech_context.phrases.append(settings.context_phrase())
return recognition_config
@staticmethod
def build_configuration_request(sampling_rate, settings):
config_req = dictation_asr_pb2.StreamingRecognizeRequest(
streaming_config=dictation_asr_pb2.StreamingRecognitionConfig(
config=StreamingRecognizer.build_recognition_config(sampling_rate, settings),
single_utterance=settings.single_utterance(),
interim_results=settings.interim_results()
)
# no audio data in first request (config only)
)
# timeout settings
timeouts = settings.timeouts_map()
for settings_key in timeouts:
cf = config_req.streaming_config.config.config_fields.add()
cf.key = settings_key
cf.value = "{}".format(timeouts[settings_key])
return config_req
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment