What is a corpus?
A corpus (plural: corpora) is commonly defined as a collection of naturally occurring language that is assembled to be representative of some language variety. Slightly simplified, a (prototypical) corpus is a large collection of texts.
Corpora are frequently used by linguists to study language use, but what many people don’t know is that they can also be very helpful for anyone interested in improving their English.
Problems that corpora can help you solve
Let’s say, for example, that you
…have been told that you need to use more linking words or connectives (e.g. however, nevertheless and thus) in your text to make it easier for the reader to follow, but you are not sure how and where they are typically used.
…are not sure whether the word big is too informal for academic writing.
…do not know which word to use instead of big in an academic text.
Good news: corpora can be used to help you with all these problems – and many more!
Free online corpora
While there are many useful corpora online that are free to use, I will focus here on one of the most user-friendly ones for the English language, namely the Corpus of Contemporary American English (COCA) (Davies, 2008). The corpus interface can be found here: https://corpus.byu.edu/coca/ .
In order to use it, you have to create a free account, which is done by clicking on the yellow person icon in the upper-right corner. Once you have signed up, you can log in and start using the corpus. Your account also gives you access to the other BYU corpora (which can be found here: https://corpus.byu.edu/ ).
We will now look at how to solve the problems described above.
How to use corpora: some examples
Example 1. Words in context: how is nevertheless used in academic writing?
If you are not sure how and where a word such as nevertheless is used, you can search for it in COCA and get several hundred, sometimes even thousands of examples. To search for a word, simply type in “nevertheless” in the search window (under the SEARCH tab), and click on “Find matching strings”. You will then be sent to the FREQUENCY tab where you will learn that this word occurs more than 15,000 times in this corpus. If you click on the word, you will be sent to the CONTEXT tab where you will see all the instances of the word in context.
This is what it looks like:
The first column lists the example number, the second column tells you when the example is from, the third tells you which genre it is from (ACAD = Academic writing), and the fourth gives you more specific information about where it is from (in this case: which field the example was used in). In order to see the full example, you click on one of these four columns.
From these examples, it is clear that while nevertheless is often used sentence-initially (followed by a comma), this linking word can also be placed elsewhere in the sentence.
In order to get a more varied sample (i.e. from many different fields or genres), you can click on “100”, “200”, “500” or “1,000” in the upper-left corner to get a random sample of that size. If you are only interested in examples from Academic writing (and thus not from any of the other genres, Magazine, Newspaper, Fiction and Spoken), you can specify this when you do the search, by clicking on “Sections” under “Find matching strings” and choosing which section(s) to search in; more about this below.
Example 2. Register/style: might big be considered informal?
For the previous example, we used COCA and its search interface in its simplest form, namely to provide examples. We will now use the corpus interface to do some slightly more advanced searches, when we investigate whether the adjective big is a suitable word for an academic text.
To do so, we will make use of the fact that the texts in COCA come from five different genres: Spoken, Fiction, Magazine, Newspaper and Academic. This fact enables us to compare the use of words or expressions across different genres. Based on the assumptions that speech tends to be less formal than writing, and that academic writing tends to be more formal than other written genres, we can use the texts that are included in the Spoken subset to represent informal writing, and the texts that are included in the Academic subset to represent formal writing.
Thus, if a word or an expression is common in the Spoken subset, but uncommon in the Academic subset, we can draw the (tentative) conclusion that this word or expression is a bit too informal for academic writing.
The way to search for big across the different genres is to click on “Sections” and then tick the box next to “Sections” under the search window. Here, we can specify which genres to compare, or just mark “Ignore” to compare all five genres.
When we click on “Find matching strings”, we will see that while big does occur in academic writing, the word is much more strongly associated with speech, which suggests that it might be considered slightly informal.
If we want to find another word to use instead, we can use the search expression “[=big]” to ask COCA to give us synonyms for big. If we leave the box next to “Sections” ticked, we will be given synonyms and their frequencies across the different genres. If we click “Find matching strings”, we get a long list of possible words to use instead.
However, as the word big is polysemous (i.e. it has many different meanings), we have to be a bit careful about which word we choose. For example, among the synonyms listed, older and adult are not really the kind of words that we are looking for.
A tip is to try to find synonyms that are frequent in academic writing and infrequent in the other genres. In this case, adjectives such as substantial, extensive and considerable seem to be good candidates if we are looking for more formal synonyms for big. However, a word of caution is perhaps in order here: in order to make sure that the word we choose can be used the way we want to use it in our text, we have to click on the word to see how it is used in more detail.
While I have given you a very brief introduction to how to use corpora using two examples, there are many, many more ways in which you can use corpora to improve your academic English. For example, corpora are also very useful for figuring out which preposition to use (e.g. to know which preposition should be used with advantage in the advantage __ this method) or for knowing which words go together (e.g. does one typically wholly understand or completely understand something?).
If you are interested in finding out more about how to use corpora, there are many online tutorials that can help you further explore the usefulness of corpora.
Davies, M. (2008). The Corpus of Contemporary American English (COCA): 520 million words, 1990-present.
McEnery, T., Xiao, R. & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. London: Routledge.
About the author
Tove Larsson has a PhD in English linguistics. In her research, she uses corpus linguistics methods to study the interface between lexis and grammar in academic writing. She currently works for the Language Workshop and the Unit for Professional English here at Uppsala University.