Scaling Automated Speech Recognition for Enterprise Use-case

Scaling Automated Speech Recognition for Enterprise Use-case

Client Background

The client is a technology and digital consulting company with 25 years of extensive experience in Europe and the US. Their esteemed customer base consists of nearly 80% of IBEX 35 companies. Over the years, the company has accumulated a substantial amount of audio data, including audio files of various sizes. In order to efficiently transcribe these files, the client is in need of a scalable solution that can handle real-time and batch transcription.

Problem Statement

The client's current process proved inadequate for managing the extensive amounts of audio data they had accumulated. The data lacked structure and presented challenges for analysis, severely limiting its utility. Given the absence of a dependable transcription solution, the company faced difficulties in extracting valuable insights from their data, greatly impeding their ability to make informed decisions.


We presented an AI-powered transcription service capable of handling both real-time and batch transcription. Our team implemented a scalable cloud-based infrastructure utilizing Nvidia's Triton Inference Server to effectively manage large volumes of data. The solution relied on Kaldi ASR, an advanced speech recognition technology, to accurately convert audio files into text.

Results & Highlights

  • To achieve scalability, we developed Python bindings for the Kaldi ASR client and integrated multi-GPU support into Triton's Kaldi backend. This allowed us to effortlessly scale our solution to any number of GPUs on the host machine, enabling parallel inferencing.
  • Our solution utilizes gRPC to establish communication with the Triton server, ensuring the even distribution of workload and the seamless propagation of exceptions to the client for enhanced performance monitoring.
  • Our bindings are currently the only available solution for the problem of scalable inference using Triton for Kaldi ASR models.
  • The outcome is a Python package that offers a versatile high-level API, providing complete flexibility.


The AI-powered transcription solution converted the client's extensive audio data into structured and analyzable text, allowing the client to extract valuable insights from their data and enhance their decision-making process. The solution's scalability ensured that the client could effectively handle increasing data volumes without compromising on speed or accuracy. This case exemplifies how AI technology can be utilized to overcome challenging data obstacles and seize opportunities for valuable insight extraction..