Skip to content

Words of Wisdom, Words of Strife, Words that Write the Book I Like

December 17, 2010

That didn’t take long.  In the Times this morning, only weeks after a study of the prevalence of words in the titles of British books from the 19th century, an article about how

Google has made a mammoth database culled from nearly 5.2 million digitized books available to the public for free downloads and online searches, opening a new landscape of possibilities for research and education in the humanities…It consists of the 500 billion words contained in books published between 1500 and 2008 in English, French, Spanish, German, Chinese and Russian.

I’d expressed my concern with that other study’s focus on single words as too “direct,” especially when dealing with a euphemistic culture like 19th century England.  In this new database, anyone with a computer can enter a phrase of up to five words to see its popularity over time, and even link to the Google Books available in different slices of time.

Perhaps aware of the potential for irony on asking people to pay for the results of so much free data, the first research paper on this data has been “unpaywalled” at the journal Science (registration required nevertheless), subscriptions to which run $75 per year and up.  Hoping perhaps to ride the popularity of “Freakonomics,” one of the researchers has labeled their work “culturomics.”  They’ve definitely done some interesting work with the data.  In this example (cleaned up to remove all the citations), they offer tangible proof of what we all “knew” but which is now definitely proved by the data.

Suppression – of a person, or an idea – leaves quantifiable fingerprints. For instance, Nazi censorship of the Jewish artist Marc Chagall is evident by comparing the frequency of “Marc Chagall” in English and in German books. In both languages, there is a rapid ascent starting in the late 1910s (when Chagall was in his early 30s). In English, the ascent continues. But in German, the artist’s popularity decreases, reaching a nadir from 1936-1944, when his full name appears only once. (In contrast, from 1946-1954, “Marc Chagall” appears nearly 100 times in the German corpus.)  Such examples are found in many countries, including Russia (e.g. Trotsky), China (Tiananmen Square) and the US (the Hollywood Ten, blacklisted in 1947).
We probed the impact of censorship on a person’s cultural influence in Nazi Germany. Led by such figures as the librarian Wolfgang Hermann, the Nazis created lists of authors and artists whose “undesirable”,  degenerate” work was banned from libraries and museums and publicly burned. We plotted median usage in German for five such lists: artists (100 names), as well as writers of Literature (147), Politics (117), History (53), and Philosophy (35). We also included a collection of Nazi party members [547 names]. The five suppressed groups exhibited a decline. This decline was modest for writers of history (9%) and literature (27%), but pronounced in politics (60%), philosophy (76%), and art (56%). The only group whose signal increased during the Third Reich was the Nazi party members [a 500% increase].

Given such strong signals, we tested whether one could identify victims of Nazi repression de novo. We computed a “suppression index” for each person by dividing their frequency from 1933 – 1945 by the mean frequency in 1925- 1933 and in 1955-1965. In English, the distribution of suppression indices is tightly centered around unity. Fewer than 1% of individuals lie at the extremes.

In German, the distribution in much wider, and skewed leftward: suppression in Nazi Germany was not the exception, but the rule. At the far left, 9.8% of individuals showed strong suppression (s<1/5). This population is highly enriched for documented victims of repression, such as Pablo Picasso (s=0.12), the Bauhaus architect Walter Gropius (s=0.16), and Hermann Maas (s<.01), an influential Protestant Minister who helped many Jews flee.

This is an analytic method that could not only be used historically, but in contemporary situations as well, by both government and non-government agencies to see trends in culture and media in countries around the globe – “hard data” that could put the lie to government propaganda from nations that deny they are suppressing ideas, speech, or the people who promote them.

I don’t really understand the percentages returned by the tool, though – a search for “at,and,the” for instance says that the word “the” appears in 6% of the scanned English books.  I would assume that “the” would appear in every English book in the database.  “And” scores in 3% of the books, and “at” in only .4%.  Am I doing something wrong?  [Next day edit – or does 6% mean that uses of the word “the” comprise 6% of all the words in the data?]

All the same, even a simple search for “atheism,atheist” might give interesting results, though I can’t claim to tell if they’re “statistically significant.”  Run here from 1900-2008 (the cutoff date):


The references go down during WWI, spike during the Roaring 20s, decline again during the Depression, rise steadily during and after the horrors of WWII, spike during the 1960s, decline during the “new age” years of the 1970s, and (especially in America, see below) start trending steadily upward beginning in the “Moral Majority” years, increasing dramatically after 9/11, when Osama Bin Laden claimed that God wanted Americans dead and Jerry Falwell agreed

So are those really the connections between the rise and fall of the topic?  It’s hard to say without context – either atheism, or the fear of it (“atheistic Roosevelt Socialists!”) could be the dominant concept in play.  Run the same searches on “British English”:


And “American English”:


The recent spike is far more dramatic in Britain, home to Dawkins and Hitchens and most all the “new atheists,” but you can see a steady rise in the US (probably not accidentally coinciding with the dawn of the “Moral Majority” and its continuing bid to establish an American theocracy). 

Again, I’m too innumerate to know what’s “statistically significant,” but then again this is a whole new data set in the world, and “digital humanities” is a new field, so maybe I’m not the only one who’s just figuring it out.  It’s all a thrilling addition to “liberal arts” as far as I’m concerned – human interpretation of human events isn’t being replaced, but rather augmented with not only new tools but whole new toolboxes.

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: