In this article, I will employ audio software and leverage theoretical knowledge to conduct an analysis of Mandarin Chinese. The analysis will begin with an overview of the fundamental principles of Mandarin Chinese pronunciation.
There are three components in the phonological structure of Mandarin Chinese syllables: 声母 (initials), 韵母 (finals), and 尾音 (coda). In the context of Mandarin Chinese phonology, ‘initials’ refer to the consonant sounds that start a syllable. ‘Finals’ consist of the main vowel or vowels of a syllable, which may be followed by a ‘coda’, the sound at the end of a syllable that comes after the main vowel(s). It is important to note that the concept of ‘coda’ is primarily used in linguistic analyses to describe the syllable’s ending sound, which is not always explicitly highlighted in discussions of Mandarin phonetics.
I will use two methods in Audacity for analysis: most of the time, I will analyze the frequency, and I will also apply some EAC analysis.
Source: https://en.wikibooks.org/wiki/Chinese_(Mandarin)/Table_of_Initial-Final_Combinations
analyze by frequency
analyze by eac


In the image on the left, I have shown the differences in the pronunciation spectra of “University of Waterloo” in English and Chinese. The model I’ve been using for English is ElevenLabs, and Chinese is Narakeet, for the consistency, these two models will be used in the following text.
In English, the name is presented as ‘University of Waterloo,’ forming a continuous phrase without explicit separators between words. In contrast, the Chinese Pinyin translation, ‘huá tiě lú dà xué (滑铁卢大学),’ clearly demarcates each syllable with spaces, reflecting the discrete nature of Chinese characters. The initial difference I noticed is that in the Chinese “University Waterloo,” there is a clear gap, indicated by a green line, between each character, while that does not exist in English.
<– English Sample Chinese Sample –>


In Chinese, the initial consonant part at the beginning of a syllable is referred to as the “initial” (声母).

In this section, I will use two Chinese audio clips to analyze the initials in Chinese. The first item is the English translation of “University Waterloo”, “huá tiě lú dà xué”. The second passage is chinese digits counting from 1 – 10, yī èr sān sì wǔ liù qī bā jiǔ shí.

chinese digit sound 1 – 10

We will start from the scenario 1: where the initial is followed by only one vowel. In the first sample, it is ‘lu’ and ‘da’. In the second sample, I will use ‘sān’ and  ‘shí’.

From these three examples, it can be seen that “l, d” are short consonants, while “s” and “sh” are long consonants. The spectrum of each consonant is also different.





We can see that, apart from being relatively broad, the energy of the “sh” sound is concentrated in the higher frequency area, between 2000-8000Hz. Moreover, its high-frequency energy does not show significant attenuation. In addition, “sh” also possess a higher energy distribution in lower frequencies than “s”. “s” however, has slightly more energy over the really high frequencies than “sh”.

Regarding the “d” sound, it is observed that there is a sharp increase in energy over a short period. Additionally, compared to “t,” “d” has more low-frequency energy because of vocal cord vibration.

The “l” sound is a voiced sound, so the vocal cord vibrations produce more low-frequency energy in the spectrum. At the same time, the energy roll-off in the high-frequency part of the “l” sound spectrum is not particularly pronounced.

the first green box is the consonant ‘l’, and the second green box is the consonant ‘d’

first box: ‘s’, second box: ‘sh’

left: lú; right: dà

spectrogram of tiě, showing the transition from t -> ti -> iě

spectrogram of xué, showing the transition from x -> xu -> ué

Initial -- Final Transition

Sometimes, the final following the initial contains more than one vowel. When there are two vowels, something wonderful happens. I will explain this to you with the following examples.
The first part exclusively contains consonants, and its spectral characteristics depend on the specific articulation of the consonant. These include the position of articulation and the degree to which the airflow is obstructed.
The third part is primarily composed of vowels and sometimes includes codas. The vowel region displays clear formant peaks on the spectrogram, which I will detail in subsequent sections. These formant peaks reflect the bright frequencies of vowels.
The second part is particularly intriguing as it demonstrates the combination of consonants and vowels. On the spectrogram, this section differs from the first two parts. Compared to pure consonants, the frequency density here is higher, yet it lacks the especially prominent bright frequencies relative to the pure vowel section. This indicates that in this transition part, we witness the process of moving from the clear articulation of consonants to the sound production of vowels. Overall, this section achieves a smooth transition from the initial consonant to the vowel.
spectrogram of huá, showing the transition from h -> hu -> uá




Final only

Sometimes, there does not exist initials in a word. For example, in the Chinese digits, yī, èr and wǔ does not have initials. This can be shown in the spectrum, where there is only one energy distribution.

spectrogram of yī – 1, èr – 2  and wǔ – 5, see how it differentiates from other sounds

frequency spectrogram, window size = 2048

EAC spectrogram, window size = 2048

In Pinyin system for Chinese, when the letter “i” serves as a rhyme (vowel) without a preceding initial (consonant), “y” is usually added to clearly indicate the beginning of the syllable. Similarly, for the letter “u” when used as a vowel without a preceding initial, “w” is added. When “ü” acts as a vowel without an initial in front, the two dots are usually removed, and a “y” is added in the Pinyin representation.

Source: https://web.mit.edu/jinzhang/www/pinyin/tones/index.html

Finals -- Tones

In Chinese, “finals” are referred to as “韵母” (yùnmǔ), where “韵” (yùn) signifies the concept of tone. Therefore, in this section, I will focus on demonstrating how tones vary within the finals. However, before that, I’m going to show some of the basic theories first, the definition of “five level tone marks”.
Here is the explanation of each tone:
First Tone (High Level Tone): Marked with a macron (¯) above the vowel, indicating a high and steady pitch level.
Second Tone (Rising Tone): Marked with an acute accent (´) above the vowel, indicating the pitch rises from a medium to a high level.
Third Tone (Falling-Rising Tone): Marked with a caron (ˇ) above the vowel, indicating the pitch first falls then rises. Note that in actual speech, the third tone often manifests as a low pitch, with the rising part frequently omitted.
Fourth Tone (Falling Tone): Marked with a grave accent (`) above the vowel, indicating the pitch falls sharply from high to low.
Neutral Tone (also known as the Fifth Tone or Light Tone): Not marked with any specific tone mark, but its pronunciation is lighter and shorter, with the pitch varying depending on the preceding tone.
Here, we would use “huá tiě lú dà xué” again, for analyzing the 2nd – 4th tone.
The first final is uá. Clearly, you can see that this spectrogram goes upwards.
The second final is iě, where there is a slight decrease in the spectrogram.
The third final is ú. As the spectrogram shown, there is a rise in tones in the end.
The fourth final, à, showing a decrease of frequency.
The fifth one ué is similar as the first one, where there is a increase in the final.

For the first tone, sān is an example. The EAC diagram clearly shows the pitch stays the same (and high) for this sound.



In Chinese phonetics, there indeed exists the concept of “coda”, also known as “final sounds” (尾音), but this term is primarily used in the fields of phonology and linguistics. In the phonetics of the Chinese language, a syllable can generally be divided into the initial (声母), the final (韵母), and the tone, which is inside the final. The final can further be broken down into the initial vowel part and the ending part, where the ending part is referred to as the “final sound”, coda. Codas are especially evident in some Chinese dialects that retain features of ancient Chinese pronunciation, such as Cantonese and Min Nan, which may include nasal finals (such as /m/, /n/, /ŋ/) and stop finals (such as /p/, /t/, /k/).

In Mandarin Chinese, we only keep ‘n’ and ‘ng’. And the spectrum shows below, where you can see a little ‘tail’ after the vowel, and there’s little fundamental frequency in the EAC. I have already shared the sān example, I will show the ‘shēng’ example below. ‘shēng’ has lots of meanings in Chinese. One of the meaning of it is ‘声’, which means the topic of this course (audio) !



Mandarin Chinese is indeed renowned for its phonetic regularity and relative simplicity, especially in terms of its tonal and syllabic structure. Of course, mastering the pronunciation of Chinese is just the first step in exploring Chinese culture. As the world’s only logographic writing system and a civilization of agriculture passed down over five millennia, Chinese culture is incredibly sophisticated. By learning Mandarin and Chinese characters, one can gain a deeper understanding of the nuances and essence of Chinese culture, appreciating its wisdom and aesthetic value. This process is undoubtedly challenging, but also immensely rewarding.
Scroll to Top