Back to Blog

29. Juli 2025 •3 minutes read

Integrating TTS and STT in Java for WCAG accessibility

Danijel Dragičević

Software Engineer

What is WCAG and why we should care

As digital technology continues to shape how we live and work, making digital content accessible to everyone is no longer optional – it’s essential. The Web Content Accessibility Guidelines (WCAG), developed by the W3C, provide a clear framework for making web content accessible to people with disabilities.

Structured around four core principles – Perceivable, Operable, Understandable, and Robust (POUR), WCAG helps development teams build more inclusive and legally compliant applications. Among its many recommendations, WCAG emphasizes the importance of offering content in multiple modalities, such as text alternatives for audio and audio alternatives for text.

This is exactly where Text-to-Speech (TTS) and Speech-to-Text (STT) technologies come into play.

Why voice accessibility matters

Voice-based interactions offer tremendous value to users with visual, motor, or cognitive disabilities. TTS reads content aloud, supporting users who struggle with reading or visual processing. On the other hand, STT allows users to speak commands or content, which is especially useful for those with difficulty typing or using touch interfaces.

Integrating speech functionality into our applications goes beyond simply meeting accessibility requirements. It creates meaningful, everyday improvements in user experience. Consider scenarios like:

Reading assistance for users with dyslexia, vision loss, or cognitive disabilities who rely on TTS to consume web content.
Hands-free interaction in environments where typing isn’t practical, like users with motor impairments or professionals on the move.
Voice-powered form filling, note-taking, or messaging for users who prefer or need to speak instead of type.
Multi-language applications that detect spoken or written input in various languages and respond accordingly.

To bring these capabilities into a real-world application, I developed a Java Spring Boot backend that integrates with several AWS services: Amazon Polly for TTS, Amazon Transcribe for STT, Amazon Comprehend for language detection, and Amazon S3 for audio file storage.

Let’s walk through how this backend is designed and how each piece fits together to support WCAG-aligned voice accessibility.

How the backend works

Text-to-Speech (TTS) flow

When a user submits text that should be read aloud, the backend performs the following steps:

Language Detection – The input text is analyzed using Amazon Comprehend to determine the most dominant language.
Voice Selection – Based on the detected language, a matching voice is selected using internal logic (e.g., Joanna for English, Marlene for German).
Speech Synthesis – The text and selected voice are sent to Amazon Polly, which returns a high-quality MP3 audio stream.
Response Streaming – This audio is then streamed back to the client for immediate playback.

Here’s the repository logic used to detect language with Comprehend:

public String detectLanguage(String text) throws ComprehendRepositoryException {<br>        try {<br>            DetectDominantLanguageRequest request = DetectDominantLanguageRequest.builder()<br>                    .text(text)<br>                    .build();<br><br>            DetectDominantLanguageResponse response = comprehendClient.detectDominantLanguage(request);<br><br>            if (!response.languages().isEmpty()) {<br>                return response.languages().get(0).languageCode();<br>            } else {<br>                return "en"; // Default to English if no languages are detected<br>            }<br><br>        } catch (ComprehendException e) {<br>            log.error("AWS Comprehend error while detecting language", e);<br>            throw new ComprehendRepositoryException("AWS Comprehend error", e);<br>        } catch (Exception e) {<br>            log.error("Unexpected error accessing Comprehend service", e);<br>            throw new ComprehendRepositoryException("Error accessing Comprehend service", e);<br>        }<br>    }

And here’s how we send the text to Polly for speech synthesis:

public InputStream convertTextToSpeech(String text, String pollyVoiceId, String pollyLocaleCode) throws PollyRepositoryException {<br>        try {<br>            SynthesizeSpeechRequest request = SynthesizeSpeechRequest.builder()<br>                    .text(text)<br>                    .voiceId(VoiceId.fromValue(pollyVoiceId))<br>                    .languageCode(LanguageCode.fromValue(pollyLocaleCode))<br>                    .outputFormat(OutputFormat.MP3)<br>                    .engine(Engine.NEURAL)<br>                    .build();<br><br>            return pollyClient.synthesizeSpeech(request);<br>        } catch (PollyException e) {<br>            log.error("AWS Polly error while converting text to speech", e);<br>            throw new PollyRepositoryException("AWS Polly error", e);<br>        } catch (Exception e) {<br>            log.error("Unexpected error accessing Polly service", e);<br>            throw new PollyRepositoryException("Error accessing Polly service", e);<br>        }<br>    }

By leveraging Polly’s neural engine and multilingual voice support, the system produces natural, localized speech output suitable for a wide range of users.

Speech-to-Text (STT) flow

The reverse workflow is just as smooth. When a user uploads an audio recording, the backend processes it as follows:

Upload to S3 – The audio file is stored in a secure Amazon S3 bucket.
Start Transcription – The S3 URL is passed to Amazon Transcribe, which launches an asynchronous transcription job.
Check Job Status – Clients can periodically check on the job status via a unique identifier. Once the job completes, the transcript becomes available.
Clean Up – A scheduled background task periodically removes completed jobs and associated files from S3 to free up resources.

Here’s a snippet showing how we upload the audio file:

public String uploadAudioFile(MultipartFile audioFile) throws S3RepositoryException {<br>        String key = "audio-" + UUID.randomUUID() + ".mp3";<br><br>        PutObjectRequest putRequest = PutObjectRequest.builder()<br>                .bucket(bucketName)<br>                .key(key)<br>                .contentType(audioFile.getContentType())<br>                .build();<br><br>        try (InputStream inputStream = audioFile.getInputStream()) {<br>            s3Client.putObject(putRequest, RequestBody.fromInputStream(inputStream, audioFile.getSize()));<br>        } catch (S3Exception e) {<br>            log.error("Error uploading file to S3", e);<br>            throw new S3RepositoryException("Error uploading file to S3", e);<br>        } catch (Exception e) {<br>            log.error("Unexpected error uploading file to S3", e);<br>            throw new S3RepositoryException("Unexpected error uploading file to S3", e);<br>        }<br>        return key;<br>    }

And this is how we start a transcription job with Amazon Transcribe:

public String startTranscriptionJob(String s3Key) throws TranscribeRepositoryException {<br>        String jobName = "job-" + UUID.randomUUID();<br><br>        Media media = Media.builder()<br>                .mediaFileUri("s3://" + bucketName + "/" + s3Key)<br>                .build();<br><br>        StartTranscriptionJobRequest request = StartTranscriptionJobRequest.builder()<br>                .transcriptionJobName(jobName)<br>                .mediaFormat(MediaFormat.MP3)<br>                .media(media)<br>                .identifyLanguage(true)<br>                .languageOptions(LanguageCode.EN_US,LanguageCode.DE_DE, ...)<br>                .build();<br>        try {<br>            transcribeClient.startTranscriptionJob(request);<br>            return jobName;<br>        } catch (TranscribeException e) {<br>            log.error("Failed to start transcription job", e);<br>            throw new TranscribeRepositoryException("Failed to start transcription job", e);<br>        }<br>    }

Once the job completes, the transcript can be retrieved like this:

public String fetchTranscript(String jobName) throws TranscribeRepositoryException {<br>        try {<br>            GetTranscriptionJobResponse response = transcribeClient.getTranscriptionJob(<br>                    GetTranscriptionJobRequest.builder()<br>                            .transcriptionJobName(jobName)<br>                            .build()<br>            );<br><br>            String transcriptUrl = response.transcriptionJob().transcript().transcriptFileUri();<br><br>            try (InputStream in = new URL(transcriptUrl).openStream()) {<br>                JsonNode json = objectMapper.readTree(in);<br>                return json.at("/results/transcripts/0/transcript").asText();<br>            }<br>        } catch (TranscribeException | java.io.IOException e) {<br>            log.error("Failed to fetch transcript for job {}", jobName, e);<br>            throw new TranscribeRepositoryException("Failed to fetch transcript", e);<br>        }<br>    }

Live demo: See it in action

To demonstrate the backend in a real-world scenario, I’ve built a simple frontend application, available at https://talkscribe.org. Built using plain HTML, CSS, and JavaScript, it offers an intuitive UI for:

Typing in text and listening to it spoken aloud via TTS.
Recording audio and receiving transcriptions using STT.

The frontend communicates with the backend API through stateless HTTP requests, making it easy to understand and extend. It serves as a working reference for teams looking to integrate similar accessibility features into their applications.

Conclusion

By combining AWS’s voice services with a well-structured Java backend, this project delivers meaningful accessibility improvements in line with WCAG standards. The architecture is modular and cloud-native, making it easy to maintain, expand, or adapt to other technologies in the future.

If you’d like to dive deeper into the implementation or reuse it in your projects, the full backend code is available as open source on GitHub:

https://github.com/danijeldragicevic/talkscribe-api

Whether you’re building for accessibility, innovation, or both, this kind of integration is a meaningful step forward. Keep building with empathy and don’t forget the power of voice.

#wcag #ai #api #integration

Facebook Linkedin

Tags:Skip tags

Danijel Dragičević

Software Engineer

Danijel Dragičević is a software developer and content creator who has been part of our family since April 2014. With a strong background in backend development, he has spent the past few years specializing in building robust services for API integrations. Passionate about clean code and efficient workflows, he continuously explores new technologies to enhance development processes.

Integrating TTS and STT in Java for WCAG accessibility

What is WCAG and why we should care

Why voice accessibility matters

How the backend works

Text-to-Speech (TTS) flow

Speech-to-Text (STT) flow

Live demo: See it in action

Conclusion

Danijel Dragičević

Related posts.

Erzählen Sie
uns mehr über
Ihre Bedürfnisse.

Integrating TTS and STT in Java for WCAG accessibility

What is WCAG and why we should care

Why voice accessibility matters

How the backend works

Text-to-Speech (TTS) flow

Speech-to-Text (STT) flow

Live demo: See it in action

Conclusion

Danijel Dragičević

Related posts.

Using MongoDB for distributed locking

A11y at ProductDock: Innovation through digital accessibility testing

RAG and NLQ in Generative AI: How natural language queries transform data analytics

Erzählen Sie uns mehr über Ihre Bedürfnisse.

Erzählen Sie
uns mehr über
Ihre Bedürfnisse.