This situation is currently changing, due to the availability of automatic parsers and part-of-speech taggers, as we will see below.Īll in all, despite the existence of several frequency lists in Chinese, there are only three sources that provide easy access for individual researchers and other people interested in the Chinese language. One main problem with Chinese word frequencies is that Chinese words are not written separately, making the segmentation of the corpus into words labor-intensive if one wants to have information beyond single character frequencies (Chinese words can consist of one to four or even more characters). In addition, some of these sources are copyright protected. When reading Table 1, it is important to keep in mind that many corpora were meant to be representative for the language produced in Chinese speaking regions and not necessarily for the language daily heard and read by Chinese speaking people. In Table 1 we summarize the most interesting lists we have encountered in our search. Most of these lists are not publicly available, but can be obtained from the researchers. For instance, the recently published A Frequency Dictionary of Mandarin Chinese: Core Vocabulary for Learners only contains information about the 5,000 most frequently used words.Ī second source of word frequency information consists of frequency lists that have been compiled by linguists and official organizations ( for an earlier review). Most of other frequency-based dictionaries contain even less words. This dictionary is based on a corpus of 25 millions characters, but unfortunately only provides information about the 10,000 most frequent words, making it less suited for low-frequency items. Another dictionary that can be used is the Frequency Dictionary of Modern Chinese words in common uses ≪ ≫ (1990). A further limitation is the rather small size of the underlying corpus. Although this dictionary has been very useful, it is becoming increasingly outdated, as it is based on publications from the 1940s to the 1970s. The source most frequently used thus far has been the Dictionary of Modern Chinese Frequency ≪ ≫ (1986), which is based on a corpus of 1.8 million characters (or 1.3 million words after segmentation) and provides frequency information for 31,159 words. Then, we describe the contribution a new frequency measure based on film subtitles is making in other languages and we present a similar database for Mandarin Chinese.Īvailable sources of Chinese word frequenciesĪ first way to find information about Chinese word frequencies is to look them up in published frequency-based dictionaries. In this text, we first describe the frequency measures that are available for Chinese. By far the most important word feature is word frequency. Research on the Chinese language requires reliable information about word characteristics, so that the stimulus materials can be manipulated and controlled properly. Finally, a Chinese character represents a syllable, which most of the time is a morpheme (i.e., the smallest meaningful element), and many Chinese words in fact are disyllabic compound words. This is likely to have consequences for eye movement control in reading. Another characteristic of the Chinese writing system is that there are no spaces between the words. For example, the logographic writing system makes it impossible to compute the word's phonology on the basis of non-lexical letter to sound conversions. Not only is Chinese one of the most widely spoken languages in the world, it also differs in interesting ways from the alphabetic writing systems used in the Western world. Research on the Chinese language is becoming an important theme in psycholinguistics.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |