How to: Use AudioShake to improve accuracy in localization, transcription, and captioning workflows on AWS

Jessica Powell
June 21, 2024

A significant market opportunity exists for individual creators and organizations to provide transcription, dubbing, and captioning services to reach global audiences rapidly. By quantifying demand and optimizing these services, providers can deliver high-quality viewing experiences for consumers worldwide.

Automated speech recognition (ASR), artificial intelligence (AI), and machine learning (ML) enables greater access to captioning and localization services for diverse content types. But, background noise and music in videos can diminish transcription accuracy. And, in the absence of multi-track audio, localized outputs may lack original music and sound effects.

Intelligent automation offers possibilities to expand inclusion and access to content, but accuracy issues persist, particularly with noisy audio. Targeted improvements to speech recognition and multi-track audio support can facilitate more nuanced localization that retains artistic intent. The market landscape suggests the potential to scale captioning and localization solutions to unlock business value.

AudioShake provides an innovative solution to improve audio processing for ASR, captioning, dubbing services, and more. Leveraging patented sound separation technology, AudioShake can isolate dialogue and music and effects within a single audio track from film, TV, or radio recording, allowing cleaner input audio for downstream services. Consequently, transcription accuracy increases for ASR and captioning. For dubbing services, AudioShake offers an additional benefit. By separating music and effects, localized audio retains the same high-production value as the original content. This provides a more authentic viewing experience during localization.

For film production studios like Germany’s Pandastorm, AudioShake’s technology makes the once impossible, possible, opening doors to localization for archival productions that lack multi-tracked audio. Pandastorm used AudioShake to split the dialogue and background music and effects from the iconic British show, Doctor Who, with the series dubbed with human voices in German.

Likewise, dubbing companies using fully automated or human-AI hybrid solutions can leverage AudioShake’s technology at scale. Cielo24, an AudioShake dubbing customer, achieved transcription accuracy improvements of more than 25%, allowing the company to transcribe audio and video content faster and more efficiently using AudioShake. This saves valuable time and resources. Watch how Cielo24 implemented AudioShake in its ASR and synthetic voice pipeline here.

This blog post provides guidance for building an event-driven automated workflow using Amazon Web Services (AWS) services like AWS Lambda, Amazon EventBridge, Amazon API Gateway, AWS Step Functions, Amazon SQS, and Amazon S3, with AudioShake Groovy to separate dialogue, and music and effects from media assets.

Workflow walkthrough

A diagram showing the use of AWS serverless architecture for separating dialogue and music and effects from media assets using AudioShake Groovy.

The workflow uses an event-driven approach to initiate the process of separating dialogue and music and effects from media assets.

  1. An upstream service sends an Amazon EventBridge event to initiate the workflow.
  2. Based on the event, an Amazon EventBridge rule invokes an AWS Lambda function to call the AudioShake API. This uploads a video or audio asset to AudioShake. The Lambda function generates an Amazon S3 pre-signed URL for asset processing by AudioShake.
  3. The AudioShake API uses the pre-signed URL to download the asset. AudioShake returns an asset ID used in step 4 to create a job to process the asset. An event is sent to the EventBridge event bus to trigger step 4.
  4. Using the asset ID from step 2, a Lambda function calls the AudioShake API to create a stem job. This separates dialogue from music and effects. A callback URL is specified in the API call. AudioShake uses this to generate an API call when the job completes.
  5. The AudioShake API receives the API call in step 4, creating a job. Using AWS Step Functions, the job is processed by putting a message in an Amazon SQS queue to process the asset.
  6. AudioShake polls the message from SQS queue, initiating processing of the asset. This separates the dialogue and music and effects.
  7. Once complete, AudioShake calls the Amazon API Gateway API. This invokes a Lambda function to receive the completion notification. A pre-signed URL provides the location to download the dialogue and music and effects files. API Gateway triggers a Lambda to process the API request, putting an event on the EventBridge event bus.
  8. An EventBridge rule triggers another Lambda function. This uses the S3 pre-signed URL from AudioShake to download the files to Amazon S3.

The AudioShake job completion can trigger an Amazon EventBridge event to publish notification of the downloaded asset. The following workflow demonstrates using event-based architecture to parse this job completion event from AudioShake, invoke downstream services such as Amazon Transcribe for caption generation, Amazon Translate for subtitle generation, and Amazon Polly for synthesized speech generation in multiple desired languages.

A diagram showing the use of AWS services on dialogue audio to generate captions, subtitles, and synthesized speech in multiple desired languages.

  1. Uploading the dialogue separation audio file to Amazon S3 triggers an Amazon EventBridge event to invoke an AWS Lambda function.
  2. The AWS Lambda function initiates an Amazon Transcribe job to generate a caption file from the dialogue audio. Using only the dialogue audio provides cleaner input, which increases the accuracy of the caption generation process.
  3. Completion of the caption file triggers an Amazon EventBridge event to invoke another Lambda function.
  4. This Lambda function initiates translation jobs on the generated caption file using Amazon Translate service to produce subtitle files in the desired output languages.
  5. Completion of each subtitle file invokes a Lambda function, via an EventBridge rule.
  6. The Lambda function launches a text-to-speech job in Amazon Polly to generate the translated audio.

The integration between AudioShake, Amazon Transcribe, Amazon Translate, and Amazon Polly enables customers to leverage multiple AWS services to efficiently process audio assets for global distribution. By connecting these services, customers can automate transcription, translation, and text-to-speech for audio content. This streamlined workflow allows customers to reach broader audiences while reducing the time and effort required to prepare multi-language audio assets.

Conclusion

The AudioShake service on AWS uses machine learning to improve audio quality for speech recognition workflows. By removing background noise and isolating speech audio, AudioShake increases the accuracy of downstream services, including localization, transcription, and captioning. This audio enhancement allows users without audio engineering expertise to achieve higher quality results.

By leveraging AWS infrastructure, AudioShake integrates high-performance GPU servers to execute rapid inferences using state-of-the-art machine learning models. In summary, AudioShake on AWS delivers an efficient solution to boost speech recognition accuracy through automated audio refinement. The service eliminates the need for manual audio editing while achieving enhanced outcomes for vital workflows such as transcription and captioning.