Cloud-based speech-to-text and text-to-speech pricing – it's all about the volume

June 29 2021
by Jean Atelsek


Basic transcription and voice services from Amazon, Azure, Google and IBM can be incorporated into applications ranging from the simple to the very complex. Pricing models show that scale matters most, with free tiers being the primary cost factor in all but the largest deployments. For smaller jobs, ISVs have packaged popular use cases into managed services, or do-it-yourselfers can build functionality at little or no expense.

The 451 Take

Speech-to-text and text-to-speech services have come a long way in the past five years – it's not unusual to see near-real-time transcription and even translation for live online presentations, and apps and widgets for 'speaking' from web pages or text files are readily available. The major cloud providers continue to add voices, languages and domain-specific vocabularies to their portfolios, giving software vendors the building blocks to create more sophisticated programs for education, customer service, gaming and other applications. The pricing models are simple, with the greatest variation coming in at lower levels of usage (under 1,500 hours of speech or four million characters of text) thanks to free tiers.

Use cases and caveats

While this report looks at pricing and features of basic speech transcription and text-to-voice simulations for four major cloud providers, software vendors have long been incorporating these technologies into higher-level services. Rudimentary speech-to-text and text-to-speech capabilities have been layered with cognitive services such as comprehension, sentiment analysis and language translation to power virtual assistants (such as Amazon Alexa and Google Home), interactive voice response systems for customer service and media services such as closed captioning.

Listening to machine-generated voices for more than a few minutes can be a soul-crushing experience. Fortunately, all four providers have a premium text-to-speech tier with more natural-sounding voices, typically at 4x the cost of standard ones. In addition, speech markup standards such as Speech Synthesis Markup Language (SSML) can be used to customize voices and add nuance to speaking styles.

Voice quality and transcription accuracy may be critical for, say, a customer service system or a medical 'digital scribe.' Assessing these factors is an important step in evaluating any such system. Some providers make it easy to superficially check these features with web-based demos, and custom dictionaries and machine learning capabilities can improve accuracy over time but require humans in the loop to assist with training the model. Google is transparent about leveraging customer input to improve transcription accuracy: its speech-to-text API charges 33% less per minute if you agree to give it access to your data to train and improve transcription quality. This openness is admirable: Amazon's and Microsoft's license terms state that the companies may collect such data by default, and it's up to the customer to opt out (although there are exceptions; e.g., no such collection is enabled with AWS's Transcribe Medical service).

TL;DR: Your mileage may vary; caveat emptor.

Speech-to-text: Amazon costs most for small jobs, least for large ones

In a world where speech makes up a good portion of many media consumption diets – think podcasts, webinars, earnings calls, lectures – transcribing spoken audio into text can capture information in a way that makes it readily storable, searchable and analyze-able. The major cloud providers themselves are building or buying more elaborate software based on these services, largely in the form of media services for transcribing, translating and subtitling video feeds. This still leaves plenty of room for ISVs such as Otter.ai, Rev and Temi to offer quality control for general transcription; other vendors add value based on expertise in medical, legal, financial or other vertical domains.

Table 1 summarizes features and pricing for basic speech-to-text offerings from Amazon, Microsoft Azure, Google and IBM. All have copious documentation and in some cases code samples and training to help customers incorporate this capability into cloud-based workflows.

Table 1: Speech-to-Text Pricing and Features for Amazon, Azure, Google and IBM Table 1: Speech-to-Text Pricing and Features for Amazon, Azure, Google and IBM

Amazon Transcribe

Azure Speech to Text

Google Speech-to-Text API

IBM Watson Speech to Text

How charged

Seconds transcribed/month

Audio hours/month

15-second intervals/month

Minutes per month

Free tier

60 audio minutes monthly for the first 12 months

Five audio hours per month

60 audio minutes per month

500 minutes per month


Four tiers based on volume, ranging from $0.024 per minute to $0.0078 per minute (for US East; pricing varies by region)

$1 per audio hour

$0.004 per 15-second interval with data logging, $0.006 without data logging

$0.02 per minute up to one million minutes per month, $0.01 per minute thereafter


Transcribe Medical

Enhanced models for video and phone calls

Premium Plan with greater capacity and security


Four US regions, Asia-Pacific (Hong Kong, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Paris), Middle East (Bahrain) South America (Sao Paulo), AWS GovCloud (US-East, US-West)

Eight US regions, Asia-Pacific (Hong Kong, India, Korea, Singapore, Australia, Japan), Canada, Europe (UK, Switzerland, France, Ireland, Netherlands), South Africa, Brazil, US Gov (Arizona, Texas, Virginia)

All Google regions; as of January 2021, the service supports regional endpoints (in preview) in the US and EU for data sovereignty

Sydney, Frankfurt, London, Tokyo, Washington DC, Dallas, Seoul

Languages supported

Streaming and batch: Chinese, English (Australian, British, US), French (French, Canadian), German, Italian, Japanese, Korean, Brazilian Portuguese, US Spanish. Batch only: Arabic (Gulf, Modern Standard), English (Indian, Irish, Scottish, Welsh), Spanish, Farsi Persian, Swiss German, Hebrew, Indian Hindi, Indonesian, Malay, Portuguese, Russian, Tamil, Telugu, Turkish

153 languages and variants Including Arabic, Catalan, Chinese (Cantonese and Mandarin), Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai

143 languages and variants including Chinese, Czech, Danish, English, French, German, Indonesian, Italian, Korean, Spanish, Turkish

Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, Spanish (Argentinian, Castilian, Chilean, Colombian, Mexican, and Peruvian dialects), and Modern Standard Arabic (broadband model only)

451 Research
The pricing regimes are quite different even for ostensibly similar services. AWS offers a free tier of 60 minutes per month, but only for the first 12 months. Google also starts charging after 60 minutes, whereas Microsoft and IBM have monthly allowances of 5 and 8+ hours. In addition to the free audio time, Microsoft throws in endpoint hosting for one custom model per month. Given their high request limits, we suspect that AWS (up to 250 concurrent requests for batch jobs) and Google (up to 900 requests per minute) are the most popular choices for third parties building products based on these services.

In terms of monthly cost, Amazon's service is the priciest at smaller volumes but due to its tiered pricing becomes cheapest at higher volumes. The usage level at which the total cost crosses over is quite high, however (see Figure 1): Amazon's second tier, which drops the per-minute cost by 37%, from $0.024 to $0.015, kicks in at 250,000 minutes – over 4,000 hours per month. The monthly cost for all the services remains below $100 until more than 70 hours of audio has been transcribed.

Figure 1

Figure 1: Speech-to-Text Monthly Cost vs. Minutes Transcribed 451 Research

A few hypothetical use cases show how modest the base cost of these services can be even at higher levels of usage. In the first scenario, let's say a first-year law student wants to capture all the audio from their lectures – 60 hours of audio per month. In the second, consider a call center that receives about 45,000 calls per month averaging two minutes each, for a total of 1,500 hours per month. For comparison, we also include a situation (open to the imagination) where the amount transcribed is 20,000 hours per month. Figure 2 shows the monthly cost for each scenario.

Figure 2

Figure 2: Speech-to-Text Monthly Cost: Three Scenarios 451 Research

The relative cost in the small and medium use cases is affected more by the free allowances than by the pricing itself; only beyond the 250,000-minute (4,000+-hour) threshold does AWS start to confer a cost advantage. IBM's lower cost tier of $0.01 versus $0.02 kicks in at one million minutes (over 16,000 hours), but at volumes over five million minutes (over 83,000 hours) per month, AWS's unit pricing is cheaper. At these levels of spending, some if not all the providers would likely be cutting deals with customers.

Text-to-speech: nearly identical pricing for the big three, with IBM as the outlier

The applicability of text-to-speech services has diminished with the growing capabilities of AI-driven virtual assistants and audio channels for gaming. Still, making documents accessible for the visually impaired and language-learning applications are among the persistent use cases.

Table 2 summarizes basic text-to-speech pricing and capabilities for Amazon, Azure, Google and IBM.

Table 2: Text-to-Speech Pricing and Features for Amazon, Azure, Google and IBM Table 2: Text-to-Speech Pricing and Features for Amazon, Azure, Google and IBM

Amazon Polly

Azure Text to Speech

Google Text-to-Speech API

IBM Watson Text to Speech

How charged

Per million characters/month

Per million characters/month

Per million characters/month

Per thousand characters

Free tier

Five million characters per month for the first 12 months

Five million characters per month

Four million characters per month

10,000 characters per month


$4 (Standard), $16 (Neural voices)

$4 (Standard), $16 (Neural)

$4 (Standard), $16 (WaveNet)

$0.02 (Standard); Premium tier also available, 'Call for pricing'


Brand voices (customized), Bilingual voices

Custom voices

Custom voices (in beta)

Custom voices (Premium tier only)


Four US regions, Asia-Pacific (Hong Kong, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), China (Ningxia), Europe (Frankfurt, Ireland, London, Paris, Stockholm), Middle East (Bahrain) South America (Sao Paulo), AWS GovCloud (US-West)

Eight US regions, Asia-Pacific (Hong Kong, India, Korea, Singapore, Australia, Japan), Canada, Europe (UK, Switzerland, France, Ireland, Netherlands), South Africa, Brazil, US Gov (Arizona, Texas, Virginia)

All Google regions

US (Dallas and Washington, DC), Asia-Pacific (Seoul, Sydney, Tokyo), Europe (Frankfurt, London)

Voices and languages supported

64 voices speaking 29 languages and variants including neural voices for English, French, Korean, Portuguese, Spanish

163 voices speaking 70 languages and variants (including Welsh), all available as neural

285 voices speaking 48 languages and variants including WaveNet voices for Arabic, Bengali, Czech, Danish, Dutch, English, Filipino, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Malayalam, Mandarin Chinese, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Tamil, Turkish, Vietnamese

35 voices speaking 16 languages and variants, all available as Neural: Arabic, Brazilian Portuguese, Chinese (Mandarin dialect), Dutch, English (US and UK dialects), French, German, Italian, Japanese, Korean, and Spanish (Castilian, Latin American, and North American dialects)

451 Research
How much is four million to five million characters? In English, according to Wordcounter.net, the King James Bible clocks in at 3,116,480 characters, so the free allowances for the standard offerings from AWS, Azure and Google are quite generous. For these vendors, the free tiers for the more expensive neural voice or WaveNet offerings are much lower: one million characters for Amazon and Google and 500,000 for Azure. As with its speech-to-text service, Microsoft provides free hosting for one custom model per month.

IBM's basic Watson text-to-speech offering has the smallest free tier – 10,000 characters per month, or about 3.5 single-spaced pages – and at $0.02 per 1,000 characters it is more expensive ($20 versus $16 per million characters) than the big three. But IBM's Standard offering uses all neural voices – which enable more natural-sounding speech – so theoretically it is closer to the premium tiers of the other providers. In all cases, voices can be tuned with SSML. Azure, in keeping with its emphasis on virtual and mixed reality, has extensive documentation on the use of visemes – a visual description of the face and mouth movements of a phoneme in a spoken language, to generate facial parameters according to input text – to animate avatars for customer service, gaming and other applications.


Enterprises and independent software vendors with the wherewithal and engineering resources to customize voices and lexicons for speech-to-text or text-to speech applications have a rich trove of cloud-based services on which to base their offerings. Azure, Google and IBM emphasize the ability to bring these capabilities on-premises, but generous free tiers for cloud-based deployments – including, in Azure's case, free hosting for custom models – make it possible to push the envelope of synthesized speech and speech recognition in the cloud. Azure also gets points for including guidance vis-à-vis 'responsible AI' in its documentation.

This look at speech services focuses strictly on pricing, not quality. Organizations considering such implementations can take advantage of a wealth of open-source projects for measuring the quality of speech recognition and speech synthesis, as well as provider services for tuning results. Managed services are readily available (although at a cost) for cutting through bespoke development and achieving the outcomes needed to take advantage of these technologies.