ENG
- Branchen
- Finanzen
Nearshore-Softwareentwicklung für den Finanzsektor – sicher, skalierbar und Compliance-gerechte Lösungen für Banking, Zahlungsverkehr und APIs.
- Einzelhandel
Softwareentwicklung für den Einzelhandel – E-Commerce, Kassensysteme, Logistik und KI-gestützte Personalisierung durch unsere Nearshore-Engineering-Teams.
- Verarbeitende Industrie
Nearshore-Softwareentwicklung für die Industrie – ERP-Systeme, IoT-Plattformen und Automatisierungstools zur Optimierung industrieller Abläufe.
- Finanzen
- Was wir tun
- Services
- Technologien
- Kooperationsmodelle
Kooperationsmodelle passend zu Ihren Bedürfnissen: Komplette Nearshoring Teams, deutschsprachige Experten vor Ort mit Nearshoring-Teams oder gemischte Teams mit unseren Partnern.
- Arbeitsweise
Durch enge Zusammenarbeit mit Ihrem Unternehmen schaffen wir maßgeschneiderte Lösungen, die auf Ihre Anforderungen abgestimmt sind und zu nachhaltigen Ergebnissen führen.
- Über uns
- Wer wir sind
Wir sind ein Full-Service Nearshoring-Anbieter für digitale Softwareprodukte, ein perfekter Partner mit deutschsprachigen Experten vor Ort, Ihre Business-Anforderungen stets im Blick
- Unser Team
Das ProductDock Team ist mit modernen Technologien und Tools vertraut und setzt seit 15 Jahren zusammen mit namhaften Firmen erfolgreiche Projekte um.
- Unsere Standorte
Wir sind ProductDock, ein Full-Service Nearshoring-Anbieter für Softwareprodukte mit Hauptsitz in Berlin und Entwicklungs-Hubs in Lissabon, Novi Sad, Banja Luka und Doboj.
- Wozu Nearshoring
Wir kombinieren Nearshore- und Fachwissen vor Ort, um Sie während Ihrer gesamten digitalen Produktreise optimal zu unterstützen. Lassen Sie uns Ihr Business gemeinsam auf das nächste digitale Level anheben.
- Wer wir sind
- Unser Leistungen
- Karriere
- Arbeiten bei ProductDock
Unser Fokus liegt auf der Förderung von Teamarbeit, Kreativität und Empowerment innerhalb unseres Teams von über 120 talentierten Tech-Experten.
- Offene Stellen
Begeistert es dich, an spannenden Projekten mitzuwirken und zu sehen, wie dein Einsatz zu erfolgreichen Ergebnissen führt? Dann bist du bei uns richtig.
- Info Guide für Kandidaten
Wie suchen wir unsere Crew-Mitglieder aus? Wir sehen dich als Teil unserer Crew und erklären gerne unseren Auswahlprozess.
- Arbeiten bei ProductDock
- Newsroom
- News
Folgen Sie unseren neuesten Updates und Veröffentlichungen, damit Sie stets über die aktuellsten Entwicklungen von ProductDock informiert sind.
- Events
Vertiefen Sie Ihr Wissen, indem Sie sich mit Gleichgesinnten vernetzen und an unseren nächsten Veranstaltungen Erfahrungen mit Experten austauschen.
- News
- Blog
- Kontakt

29. Juli 2025 •3 minutes read
Integrating TTS and STT in Java for WCAG accessibility
Danijel Dragičević
Software Engineer
What is WCAG and why we should care
As digital technology continues to shape how we live and work, making digital content accessible to everyone is no longer optional – it’s essential. The Web Content Accessibility Guidelines (WCAG), developed by the W3C, provide a clear framework for making web content accessible to people with disabilities.
Structured around four core principles – Perceivable, Operable, Understandable, and Robust (POUR), WCAG helps development teams build more inclusive and legally compliant applications. Among its many recommendations, WCAG emphasizes the importance of offering content in multiple modalities, such as text alternatives for audio and audio alternatives for text.
This is exactly where Text-to-Speech (TTS) and Speech-to-Text (STT) technologies come into play.
Why voice accessibility matters
Voice-based interactions offer tremendous value to users with visual, motor, or cognitive disabilities. TTS reads content aloud, supporting users who struggle with reading or visual processing. On the other hand, STT allows users to speak commands or content, which is especially useful for those with difficulty typing or using touch interfaces.
Integrating speech functionality into our applications goes beyond simply meeting accessibility requirements. It creates meaningful, everyday improvements in user experience. Consider scenarios like:
- Reading assistance for users with dyslexia, vision loss, or cognitive disabilities who rely on TTS to consume web content.
- Hands-free interaction in environments where typing isn’t practical, like users with motor impairments or professionals on the move.
- Voice-powered form filling, note-taking, or messaging for users who prefer or need to speak instead of type.
- Multi-language applications that detect spoken or written input in various languages and respond accordingly.
To bring these capabilities into a real-world application, I developed a Java Spring Boot backend that integrates with several AWS services: Amazon Polly for TTS, Amazon Transcribe for STT, Amazon Comprehend for language detection, and Amazon S3 for audio file storage.
Let’s walk through how this backend is designed and how each piece fits together to support WCAG-aligned voice accessibility.
How the backend works
Text-to-Speech (TTS) flow
When a user submits text that should be read aloud, the backend performs the following steps:
- Language Detection – The input text is analyzed using Amazon Comprehend to determine the most dominant language.
- Voice Selection – Based on the detected language, a matching voice is selected using internal logic (e.g., Joanna for English, Marlene for German).
- Speech Synthesis – The text and selected voice are sent to Amazon Polly, which returns a high-quality MP3 audio stream.
- Response Streaming – This audio is then streamed back to the client for immediate playback.
Here’s the repository logic used to detect language with Comprehend:
public String detectLanguage(String text) throws ComprehendRepositoryException {<br> try {<br> DetectDominantLanguageRequest request = DetectDominantLanguageRequest.builder()<br> .text(text)<br> .build();<br><br> DetectDominantLanguageResponse response = comprehendClient.detectDominantLanguage(request);<br><br> if (!response.languages().isEmpty()) {<br> return response.languages().get(0).languageCode();<br> } else {<br> return "en"; // Default to English if no languages are detected<br> }<br><br> } catch (ComprehendException e) {<br> log.error("AWS Comprehend error while detecting language", e);<br> throw new ComprehendRepositoryException("AWS Comprehend error", e);<br> } catch (Exception e) {<br> log.error("Unexpected error accessing Comprehend service", e);<br> throw new ComprehendRepositoryException("Error accessing Comprehend service", e);<br> }<br> }
And here’s how we send the text to Polly for speech synthesis:
public InputStream convertTextToSpeech(String text, String pollyVoiceId, String pollyLocaleCode) throws PollyRepositoryException {<br> try {<br> SynthesizeSpeechRequest request = SynthesizeSpeechRequest.builder()<br> .text(text)<br> .voiceId(VoiceId.fromValue(pollyVoiceId))<br> .languageCode(LanguageCode.fromValue(pollyLocaleCode))<br> .outputFormat(OutputFormat.MP3)<br> .engine(Engine.NEURAL)<br> .build();<br><br> return pollyClient.synthesizeSpeech(request);<br> } catch (PollyException e) {<br> log.error("AWS Polly error while converting text to speech", e);<br> throw new PollyRepositoryException("AWS Polly error", e);<br> } catch (Exception e) {<br> log.error("Unexpected error accessing Polly service", e);<br> throw new PollyRepositoryException("Error accessing Polly service", e);<br> }<br> }
By leveraging Polly’s neural engine and multilingual voice support, the system produces natural, localized speech output suitable for a wide range of users.
Speech-to-Text (STT) flow
The reverse workflow is just as smooth. When a user uploads an audio recording, the backend processes it as follows:
- Upload to S3 – The audio file is stored in a secure Amazon S3 bucket.
- Start Transcription – The S3 URL is passed to Amazon Transcribe, which launches an asynchronous transcription job.
- Check Job Status – Clients can periodically check on the job status via a unique identifier. Once the job completes, the transcript becomes available.
- Clean Up – A scheduled background task periodically removes completed jobs and associated files from S3 to free up resources.
Here’s a snippet showing how we upload the audio file:
public String uploadAudioFile(MultipartFile audioFile) throws S3RepositoryException {<br> String key = "audio-" + UUID.randomUUID() + ".mp3";<br><br> PutObjectRequest putRequest = PutObjectRequest.builder()<br> .bucket(bucketName)<br> .key(key)<br> .contentType(audioFile.getContentType())<br> .build();<br><br> try (InputStream inputStream = audioFile.getInputStream()) {<br> s3Client.putObject(putRequest, RequestBody.fromInputStream(inputStream, audioFile.getSize()));<br> } catch (S3Exception e) {<br> log.error("Error uploading file to S3", e);<br> throw new S3RepositoryException("Error uploading file to S3", e);<br> } catch (Exception e) {<br> log.error("Unexpected error uploading file to S3", e);<br> throw new S3RepositoryException("Unexpected error uploading file to S3", e);<br> }<br> return key;<br> }
And this is how we start a transcription job with Amazon Transcribe:
public String startTranscriptionJob(String s3Key) throws TranscribeRepositoryException {<br> String jobName = "job-" + UUID.randomUUID();<br><br> Media media = Media.builder()<br> .mediaFileUri("s3://" + bucketName + "/" + s3Key)<br> .build();<br><br> StartTranscriptionJobRequest request = StartTranscriptionJobRequest.builder()<br> .transcriptionJobName(jobName)<br> .mediaFormat(MediaFormat.MP3)<br> .media(media)<br> .identifyLanguage(true)<br> .languageOptions(LanguageCode.EN_US,LanguageCode.DE_DE, ...)<br> .build();<br> try {<br> transcribeClient.startTranscriptionJob(request);<br> return jobName;<br> } catch (TranscribeException e) {<br> log.error("Failed to start transcription job", e);<br> throw new TranscribeRepositoryException("Failed to start transcription job", e);<br> }<br> }
Once the job completes, the transcript can be retrieved like this:
public String fetchTranscript(String jobName) throws TranscribeRepositoryException {<br> try {<br> GetTranscriptionJobResponse response = transcribeClient.getTranscriptionJob(<br> GetTranscriptionJobRequest.builder()<br> .transcriptionJobName(jobName)<br> .build()<br> );<br><br> String transcriptUrl = response.transcriptionJob().transcript().transcriptFileUri();<br><br> try (InputStream in = new URL(transcriptUrl).openStream()) {<br> JsonNode json = objectMapper.readTree(in);<br> return json.at("/results/transcripts/0/transcript").asText();<br> }<br> } catch (TranscribeException | java.io.IOException e) {<br> log.error("Failed to fetch transcript for job {}", jobName, e);<br> throw new TranscribeRepositoryException("Failed to fetch transcript", e);<br> }<br> }
Live demo: See it in action
To demonstrate the backend in a real-world scenario, I’ve built a simple frontend application, available at https://talkscribe.org. Built using plain HTML, CSS, and JavaScript, it offers an intuitive UI for:
- Typing in text and listening to it spoken aloud via TTS.
- Recording audio and receiving transcriptions using STT.
The frontend communicates with the backend API through stateless HTTP requests, making it easy to understand and extend. It serves as a working reference for teams looking to integrate similar accessibility features into their applications.
Conclusion
By combining AWS’s voice services with a well-structured Java backend, this project delivers meaningful accessibility improvements in line with WCAG standards. The architecture is modular and cloud-native, making it easy to maintain, expand, or adapt to other technologies in the future.
If you’d like to dive deeper into the implementation or reuse it in your projects, the full backend code is available as open source on GitHub:
Whether you’re building for accessibility, innovation, or both, this kind of integration is a meaningful step forward. Keep building with empathy and don’t forget the power of voice.
#wcag #ai #api #integration
Tags:Skip tags
Danijel Dragičević
Software EngineerDanijel Dragičević is a software developer and content creator who has been part of our family since April 2014. With a strong background in backend development, he has spent the past few years specializing in building robust services for API integrations. Passionate about clean code and efficient workflows, he continuously explores new technologies to enhance development processes.