17 Dec 10
Description: What's all this do?
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. Let's look at a sample graph:
This shows trends in three ngrams from 1950 to 2000: "nursery school" (a 2-gram or bigram), "kindergarten" (a 1-gram or unigram), and "child care" (another bigram). What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are "nursery school" or "child care"? Of all the unigrams, what percentage of them are "kindergarten"? Here, you can see that use of the phrase "child care" started to rise in the late 1960s, overtaking "nursery school" around 1970 and then "kindergarten" around 1973. It peaked shortly after 1990 and has been falling steadily since.
(Interestingly, the results are noticeably different when the corpus is switched to British English.)
Below are descriptions of the corpora that can be searched with the Google Books Ngram Viewer. All of these corpora were generated in July 2009; we will update these corpora as our book scanning continues, and the updated versions will have distinct persistent identifiers.
Informal corpus name Persistent identifier Description
American English googlebooks-eng-us-all-20090715 Same filtering as the English corpus but further restricted to books published in the United States.
British English googlebooks-eng-gb-all-20090715 Same filtering as the English corpus but further restricted to books published in Great Britain.
Chinese (simplified) googlebooks-chi-sim-all-20090715 Books predominantly in simplified Chinese script.
English googlebooks-eng-all-20090715 Similar to Google Million, but not filtered by subject and with no per-year caps.
English Fiction googlebooks-eng-fiction-all-20090715 Same filtering as the English corpus but further restricted to fiction books.
English One Million googlebooks-eng-1M-20090715 The "Google Million". All are in English with dates ranging from 1500 to 2008. No more than about 6000 books were chosen from any one year, which means that all of the scanned books from early years are present, and books from later years are randomly sampled. The random samplings reflect the subject distributions for the year (so there are more computer books in 2000 than 1980). Books with low OCR quality were removed, and serials were removed.
French googlebooks-fre-all-20090715 Books predominantly in the French language.
German googlebooks-ger-all-20090715 Books predominantly in the German language.
Spanish googlebooks-spa-all-20090715 Books predominantly in the Spanish language.
Russian googlebooks-rus-all-20090715 Books predominantly in the Russian language.
Searching inside Google Books
Below the graph, we show "interesting" year ranges for your query terms. Clicking on those will submit your query directly to Google Books. Note that the Ngram Viewer is case-sensitive, but Google Books search results are not.
Those searches will yield phrases in the language of whichever corpus you selected, but the results are returned from the full Google Books corpus. So if you use the Ngram Viewer to search for a French phrase in the French corpus and then click through to Google Books, that search will be for the same French phrase -- which might occur in a book predominantly in another language.
But but but...
What about punctuation?
Full details of how we deal with punctuation can be found in the Science paper, but here are two of the more important rules:
Punctuation at the ends of tokens become tokens themselves. You can search for a plain period in the Ngram Viewer, and "Why?" becomes a bigram: "Why" and "?".
When a hyphen occurred at the end of a line, it was removed and the two fragments joined together into a unigram.
An example from the Science paper:
I'm seeing the man with the telescope.
This yields the following bigrams:
However, we've special-cased apostrophes so that users can keep them inside words: "can't" and "won't" will return the expected results.
Why do I see spikes and plateaus in early years?
Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.
Plateaus are usually simply smoothed spikes. Change the smoothing to 0.
What does "smoothing" mean?
Often trends become more apparent when data is viewed as a moving average. A smoothing of 1 means that the data shown for 1950 will be an average of the raw count for 1950 plus 1 value on either wide: ("count for 1949" "count for 1950" "count for 1951"), divided by 3. So a smoothing of 10 means that 21 values will be averaged: 10 on either side, plus the target value in the center of them.
At the left and right edges of the graph, fewer values are averaged. With a smoothing of 3, the leftmost value (pretend it's the year 1950) will be calculated as ("count for 1950" "count for 1951" "count for 1952" "count for 1953"), divided by 4.
A smoothing of 0 means no smoothing at all: just raw data.
Many more books are published in modern years. Doesn't this skew the results?
It would if we didn't normalize by the number of books published in each year.
Why are you showing a 0% flatline when I know the phrase in my query occurred in at least one book?
We only consider ngrams that occur in at least 40 books. Otherwise the dataset would balloon in size and we wouldn't be able to offer them all.
Why does the word "Internet" occur before 1950?
Time traveling software engineers!
Most of those are OCR errors; we do a good job at filtering out books with low OCR quality scores, but some errors do slip through.
(One old usage of the word "Internet" is legitimate. Can you find it?)