DE
- Industries
- Finance
Nearshore software development for finance—secure, scalable, and compliant solutions for banking, payments, and APIs.
- Retail
Retail software development services—e-commerce, POS, logistics, and AI-driven personalization from nearshore engineering teams.
- Manufacturing
Nearshore manufacturing software development—ERP systems, IoT platforms, and automation tools to optimize industrial operations.
- Finance
- What we do
- Services
- Technologies
- Collaboration models
Explore collaboration models customized to your specific needs: Complete nearshoring teams, Local heroes from partners with the nearshoring team, or Mixed tech teams with partners.
- Way of work
Through close collaboration with your business, we create customized solutions aligned with your specific requirements, resulting in sustainable outcomes.
- About Us
- Who we are
We are a full-service nearshoring provider for digital software products, uniquely positioned as a high-quality partner with native-speaking local experts, perfectly aligned with your business needs.
- Meet our team
ProductDock’s experienced team proficient in modern technologies and tools, boasts 15 years of successful projects, collaborating with prominent companies.
- Our locations
We are ProductDock, a full-service nearshoring provider for digital software products, headquartered in Berlin, with engineering hubs in Lisbon, Novi Sad, Banja Luka, and Doboj.
- Why nearshoring
Elevate your business efficiently with our premium full-service software development services that blend nearshore and local expertise to support you throughout your digital product journey.
- Who we are
- Our work
- Career
- Life at ProductDock
We’re all about fostering teamwork, creativity, and empowerment within our team of over 120 incredibly talented experts in modern technologies.
- Open positions
Do you enjoy working on exciting projects and feel rewarded when those efforts are successful? If so, we’d like you to join our team.
- Candidate info guide
How we choose our crew members? We think of you as a member of our crew. We are happy to share our process with you!
- Life at ProductDock
- Newsroom
- News
Stay engaged with our most recent updates and releases, ensuring you are always up-to-date with the latest developments in the dynamic world of ProductDock.
- Events
Expand your expertise through networking with like-minded individuals and engaging in knowledge-sharing sessions at our upcoming events.
- News
- Blog
- Get in touch

29. Jul 2025 •3 minutes read
Integrating TTS and STT in Java for WCAG accessibility
Danijel Dragičević
Software Engineer
What is WCAG and why we should care
As digital technology continues to shape how we live and work, making digital content accessible to everyone is no longer optional – it’s essential. The Web Content Accessibility Guidelines (WCAG), developed by the W3C, provide a clear framework for making web content accessible to people with disabilities.
Structured around four core principles – Perceivable, Operable, Understandable, and Robust (POUR), WCAG helps development teams build more inclusive and legally compliant applications. Among its many recommendations, WCAG emphasizes the importance of offering content in multiple modalities, such as text alternatives for audio and audio alternatives for text.
This is exactly where Text-to-Speech (TTS) and Speech-to-Text (STT) technologies come into play.
Why voice accessibility matters
Voice-based interactions offer tremendous value to users with visual, motor, or cognitive disabilities. TTS reads content aloud, supporting users who struggle with reading or visual processing. On the other hand, STT allows users to speak commands or content, which is especially useful for those with difficulty typing or using touch interfaces.
Integrating speech functionality into our applications goes beyond simply meeting accessibility requirements. It creates meaningful, everyday improvements in user experience. Consider scenarios like:
- Reading assistance for users with dyslexia, vision loss, or cognitive disabilities who rely on TTS to consume web content.
- Hands-free interaction in environments where typing isn’t practical, like users with motor impairments or professionals on the move.
- Voice-powered form filling, note-taking, or messaging for users who prefer or need to speak instead of type.
- Multi-language applications that detect spoken or written input in various languages and respond accordingly.
To bring these capabilities into a real-world application, I developed a Java Spring Boot backend that integrates with several AWS services: Amazon Polly for TTS, Amazon Transcribe for STT, Amazon Comprehend for language detection, and Amazon S3 for audio file storage.
Let’s walk through how this backend is designed and how each piece fits together to support WCAG-aligned voice accessibility.
How the backend works
Text-to-Speech (TTS) flow
When a user submits text that should be read aloud, the backend performs the following steps:
- Language Detection – The input text is analyzed using Amazon Comprehend to determine the most dominant language.
- Voice Selection – Based on the detected language, a matching voice is selected using internal logic (e.g., Joanna for English, Marlene for German).
- Speech Synthesis – The text and selected voice are sent to Amazon Polly, which returns a high-quality MP3 audio stream.
- Response Streaming – This audio is then streamed back to the client for immediate playback.
Here’s the repository logic used to detect language with Comprehend:
public String detectLanguage(String text) throws ComprehendRepositoryException {
try {
DetectDominantLanguageRequest request = DetectDominantLanguageRequest.builder()
.text(text)
.build();
DetectDominantLanguageResponse response = comprehendClient.detectDominantLanguage(request);
if (!response.languages().isEmpty()) {
return response.languages().get(0).languageCode();
} else {
return "en"; // Default to English if no languages are detected
}
} catch (ComprehendException e) {
log.error("AWS Comprehend error while detecting language", e);
throw new ComprehendRepositoryException("AWS Comprehend error", e);
} catch (Exception e) {
log.error("Unexpected error accessing Comprehend service", e);
throw new ComprehendRepositoryException("Error accessing Comprehend service", e);
}
}
And here’s how we send the text to Polly for speech synthesis:
public InputStream convertTextToSpeech(String text, String pollyVoiceId, String pollyLocaleCode) throws PollyRepositoryException {
try {
SynthesizeSpeechRequest request = SynthesizeSpeechRequest.builder()
.text(text)
.voiceId(VoiceId.fromValue(pollyVoiceId))
.languageCode(LanguageCode.fromValue(pollyLocaleCode))
.outputFormat(OutputFormat.MP3)
.engine(Engine.NEURAL)
.build();
return pollyClient.synthesizeSpeech(request);
} catch (PollyException e) {
log.error("AWS Polly error while converting text to speech", e);
throw new PollyRepositoryException("AWS Polly error", e);
} catch (Exception e) {
log.error("Unexpected error accessing Polly service", e);
throw new PollyRepositoryException("Error accessing Polly service", e);
}
}
By leveraging Polly’s neural engine and multilingual voice support, the system produces natural, localized speech output suitable for a wide range of users.
Speech-to-Text (STT) flow
The reverse workflow is just as smooth. When a user uploads an audio recording, the backend processes it as follows:
- Upload to S3 – The audio file is stored in a secure Amazon S3 bucket.
- Start Transcription – The S3 URL is passed to Amazon Transcribe, which launches an asynchronous transcription job.
- Check Job Status – Clients can periodically check on the job status via a unique identifier. Once the job completes, the transcript becomes available.
- Clean Up – A scheduled background task periodically removes completed jobs and associated files from S3 to free up resources.
Here’s a snippet showing how we upload the audio file:
public String uploadAudioFile(MultipartFile audioFile) throws S3RepositoryException {
String key = "audio-" + UUID.randomUUID() + ".mp3";
PutObjectRequest putRequest = PutObjectRequest.builder()
.bucket(bucketName)
.key(key)
.contentType(audioFile.getContentType())
.build();
try (InputStream inputStream = audioFile.getInputStream()) {
s3Client.putObject(putRequest, RequestBody.fromInputStream(inputStream, audioFile.getSize()));
} catch (S3Exception e) {
log.error("Error uploading file to S3", e);
throw new S3RepositoryException("Error uploading file to S3", e);
} catch (Exception e) {
log.error("Unexpected error uploading file to S3", e);
throw new S3RepositoryException("Unexpected error uploading file to S3", e);
}
return key;
}
And this is how we start a transcription job with Amazon Transcribe:
public String startTranscriptionJob(String s3Key) throws TranscribeRepositoryException {
String jobName = "job-" + UUID.randomUUID();
Media media = Media.builder()
.mediaFileUri("s3://" + bucketName + "/" + s3Key)
.build();
StartTranscriptionJobRequest request = StartTranscriptionJobRequest.builder()
.transcriptionJobName(jobName)
.mediaFormat(MediaFormat.MP3)
.media(media)
.identifyLanguage(true)
.languageOptions(LanguageCode.EN_US,LanguageCode.DE_DE, ...)
.build();
try {
transcribeClient.startTranscriptionJob(request);
return jobName;
} catch (TranscribeException e) {
log.error("Failed to start transcription job", e);
throw new TranscribeRepositoryException("Failed to start transcription job", e);
}
}
Once the job completes, the transcript can be retrieved like this:
public String fetchTranscript(String jobName) throws TranscribeRepositoryException {
try {
GetTranscriptionJobResponse response = transcribeClient.getTranscriptionJob(
GetTranscriptionJobRequest.builder()
.transcriptionJobName(jobName)
.build()
);
String transcriptUrl = response.transcriptionJob().transcript().transcriptFileUri();
try (InputStream in = new URL(transcriptUrl).openStream()) {
JsonNode json = objectMapper.readTree(in);
return json.at("/results/transcripts/0/transcript").asText();
}
} catch (TranscribeException | java.io.IOException e) {
log.error("Failed to fetch transcript for job {}", jobName, e);
throw new TranscribeRepositoryException("Failed to fetch transcript", e);
}
}
Live demo: See it in action
To demonstrate the backend in a real-world scenario, I’ve built a simple frontend application, available at https://talkscribe.org. Built using plain HTML, CSS, and JavaScript, it offers an intuitive UI for:
- Typing in text and listening to it spoken aloud via TTS.
- Recording audio and receiving transcriptions using STT.
The frontend communicates with the backend API through stateless HTTP requests, making it easy to understand and extend. It serves as a working reference for teams looking to integrate similar accessibility features into their applications.
Conclusion
By combining AWS’s voice services with a well-structured Java backend, this project delivers meaningful accessibility improvements in line with WCAG standards. The architecture is modular and cloud-native, making it easy to maintain, expand, or adapt to other technologies in the future.
If you’d like to dive deeper into the implementation or reuse it in your projects, the full backend code is available as open source on GitHub:
Whether you’re building for accessibility, innovation, or both, this kind of integration is a meaningful step forward. Keep building with empathy and don’t forget the power of voice.
#wcag #ai #api #integration
Tags:Skip tags
Danijel Dragičević
Software EngineerDanijel Dragičević is a software developer and content creator who has been part of our family since April 2014. With a strong background in backend development, he has spent the past few years specializing in building robust services for API integrations. Passionate about clean code and efficient workflows, he continuously explores new technologies to enhance development processes.