Popular Posts

Tuesday, December 31, 2013

The End of Stupidity

That the links between web pages, rather than the words on the pages, are a good guide to quality (roughly, the more links a page has, the better it is) was, as we've discussed, the key insight that propelled Google into the limelight on the Web.  But this makes the Trivial Content objections all the more puzzling:  certainly, there are lots of quality Web pages on the Web, and the success of Google seems to lie in the confirmation that, mostly, it finds them for us.  And it does, much of the time.  But the Achilles Heel of Google's ranking system--ordering the results of a search with the best first--is in the same insight that made it so popular.

Popular.  That's the Achilles Heel.  Simply put, the results on the first page of a Google search are the ones everyone else on the Web thinks are "good."  But we don't know anything about all those other Web users, except that they liked the content (by linking to it) that we're now getting.  To make the point here, in the academic journal situation (which inspired PageRank, remember), we know lots about the authors of the articles.  We know for instance, that if Person A references Article Z, written by Person B, that both A and B are published authors in peer reviewed journals--they're experts.  Hence if we collect all the references to Article Z by simply counting up the experts, we've got a good idea of the value of Z to the community of scholars who care about whatever Z's about (Z's topic).  Since we're dealing with expert authors, counting them all up (recursively, but this is a detail) makes a ton of sense.

Skip to the Web, now, and first thing that goes is "expert."  Who's to say why someone likes Web page Z?  Who's to say, if Person A likes Z, and Person B likes Z, and so on, that anyone is an expert about "Z" at all?  The Web case is different than the academic article case, then, because the users have no intrinsic connection to the content--they're not credentialed in any measurable way as experts or authorities on whatever Web page Z's about.  Lots of anonymous folks like Z; that's what we know.

This feature of Web ranking has a number of consequences.  One is that large, commercial sites tend to end up on the first page of Google results.  If I query "hiking boots", I'm likely to see lots of Web sites for big stores trying to sell me hiking boots, like REI, or Timberland, or what have you.  Of course, many Web users simply want big commercial web sites (and not, say, a blog about hiking boots, or an article about the history of hiking boots).  Most people using the Web want what most people linking things on the Web want--this is just to say that what's popular is by and large what most people want (a truism).  This is why Google works, and for the very same reason it's why it doesn't (when in fact it doesn't).

The next consequence is really a corollary of the Big Business consequence just noted.  We can call this the "Dusty Books" objection, because it's about content that is exactly what you want, but isn't exactly the most popular content.  This'll happen whenever you're looking for something that not a lot of people think about, or care about, or for whatever reason isn't popular enough to get a high ranking.  It's a dusty book, in other words, like the book you find hidden away on a shelf of the library, last checked out three years ago, say, with dust on its cover from disuse.  Only, that's what you're looking for, it turns out.  You'll never see the dusty books in Google searches.  This is the point; if you think about how Google works for a second, it's an obvious point too.  Dusty books, by definition, aren't popular.  They're the Web pages that you want, but never find, and there are lots of them.  Think for another second about Google and you'll see the deeper problem, too:  works so well most of the time for popular content means that some of the time it doesn't work at all.  All that popular, unwanted content, is guaranteed to keep your dusty book hidden forever, back on the tenth or hundredth page of search results (and who looks at those?).  Google, in other words, gives us what we want whenever it's what everyone else wants too; if it's just what you want, all those other people on the Web are now your enemies.  They're hiding your dusty book from you.

But what could we want, that's not popular?  Oh, lots of things.  If I'm thinking of driving Highway 101 from Washington to California, say, I may want a big travel planner site telling me where the hotels are, or the camping grounds, or I may want a personal blog from someone who can write, who's actually driven the route, and can tell me all sorts of "expert" things that commercial Web sites don't bother with.  This fellow's blog may or may not be popular, or linked to a big travel site, so it's a crap shoot if I find it with Google (even if it's popular as a homegrown blog, it isn't popular compared to Trip Advisor).

Faced with this scenario, many people take to a blog search engine like Google Blog Search, or Technorati, or Ice Rocket (Google Blog Search is probably the best).  Only, the popularity-as-quality approach screws this up too, if you're looking for the expert opinion from the experience traveler of 101 who writes a personal and informative blog.  Why?  Because the most linked to stories about "Highway 101" are a litany of traffic accidents in local newspaper articles (somehow considered "blogs" by Google Blog Search).  For instance, the second result for the query "driving Highway 101" to Google Blog Search is: "Woman killed on Highway 101 near Shelton."  And lest we think this is a fluke, the third result is "Can toll lanes on Highway 101 help pay for Caltrain?", and the fourth is the helpful "Man Who Had Heart Attack in Highway 101 Crash Dies in Hospital."  Clearly, what's popular to Google Blog Search has little to do with what our user interested in driving 101 has in mind.  (Incidentally, the first result is three paragraphs from northwestopinions.com about the Christmas light show on 101 every year.  At least "northwestopinions.com" might be a find.)

What's going on here?  Well, you're getting what everyone links to, that's what.  The more interesting question is how we've all managed to be in the dark about the limitations of the approach that we use day in and day out.  Even more interesting:  exactly how do you find good blogs about driving Highway 101 (or hiking boots, or lamp shades, or whatever)?  Well, most people "Google around" still, and when they happen upon (in the search biz: "discover") an interesting site, or a portal site like Fodors or Trip Advisor, they save the URL or remember how to find it again.  Mostly, they just miss dusty books, though.

To continue with the Dusty Books metaphor, and to see the problem in a different way, imagine the public library organized according to popularity, rather than expertise on the topic, or authority (books that are published are ipso facto books with authority).  Someone wrote a definite history of 101, or the guide to driving 101, but it's so detailed that most people don't bother to read it.  They get the lighter version, with the glossy cover.  Ergo, the definite guide just disappeared from the library shelf.  It's not even a dusty, seldom read book, it's simply not there anymore (this is akin to being on page 1323, say, of a Google search).  This is swell for all those 101 posers and dilettantes, but for you, you're really looking for the full, 570 page exposition on 101.  This is a ridiculous library, of course, because (we're all tempted to say, in chorus) what else is a library for, but to give you all the expertise and authoritative books on a topic?  Who cares what's darned popular?  Indeed.  Returning then to the Web world, it's easy enough to see the limits of the content we're getting (and why, most of the time, we're all happy with it).  Put it another way, the Web is skewed toward Trivial Content--every time what's popular trumps what's substantive, you get the popular.  (To be sure, when what's popular is also substantive--say, because "popular" expositions of Quantum Mechanics are those written by Scientific American writers, or MIT professors--there's no problem.)

But is this why Google is making us stupid?  Well, sort of, yes.  It's easier to see with something like "politics" or "economics", say.  If Web 2.0 liberated millions of people to write about politics, and Google simply delivers the most popular pages on this topic for us, then generally speaking all the "hard" discussions are going to fall off of the first page of a Google search.  "Popular politics" on the Web isn't William Jennings Bryan, it's usually a lot of surface buzz and griping and polarization.  Good versus evil.  Good guys, bad guys.  Doomsday predictions and everything else that crowds seize upon.  True, large media sites like the New York Times will pop up on the first page of a query about "health care crisis."  This is a consequence of popularity too (same reason that Trip Advisor shows up with hotel prices on your Highway 101 search).  But if you're looking for interesting, informed opinions at their in the public (say, from good bloggers or writers), you don't care about the NYT anyway.  Since Google doesn't care about the quality of an article, whatever has shock value is likely to be what you get for all the rest.  We might say here that, if Google isn't actively making us stupid for Trivial Content reasons alone, if we're already uninformed (or "stupid"), it's not helping us get out of this situation by directing us to the most thoughtful, quality discussions.  It's up to us to keep looking around for it, full of hope, as it were.  (And, if we don't know what to look for, we're likely to think the Google results are the thoughtful ones, which explains why half my friends in the programming world are now conspiracy theorists, too.  Four years of learning to program a computer in "real" college, and their politics on the Web, and that's what you get.  Alas.)

To sum this up, then, the full answer to the question we began with ("is Google making us stupid?") is something like, yes.  While we didn't address all the reasons, we can blanket this with:  it's a Crappy Medium with Lots of Distractions that tends to encourage reading Trivial Content.  Mostly, then, it's not helping us become classically trained scholars, or better and more educated in the contemplative and thoughtful sense.  I've chosen to focus mostly on Trivial Content in this piece, because of the three, if you're staying on the Web (and most of us will, me included), improving the quality of search results seems the most amenable to change.  It takes, only, another revolution in search.  While it's outside the scope of this article to get into details (and like Popper once said, you can't predict innovation, because if you could, you'd already have innovated), a few remarks on the broad direction of this revolution are in order, by way of closing.

Search.next()

Google's insight, remember, was that the links between Web pages, and not only the words in pages were good guides to quality.  It's interesting to note here that both the method Google replaced (the old Alta Vista search approaches that looked at correlations between words on a page and your query words) and its PageRank method relay on majority rules calculations.  In the old-style approach--what's called "term frequency - inverse document frequency or tf-idf calculation--the more frequent your query terms occur in the terms of the documents, the higher the rank they receive.  Hence, "majority rules" equals word frequency.  In the Google approach, as we've seen, "majority rules" equals link-to frequency.  In either case, the exceptions or minorities are always ignored.  This is why Google (or Alta Vista) has a tough time with low frequency situations like sarcasm:  if I write that "the weather here is great, as usual" and it's Seattle in December, most human readers recognize this as sarcasm.  But sarcasm isn't the norm, so mostly your query about great weather places in December will take you to Key West, or the Bahamas.  More to the point, if I'm looking for blogs about how the weather sucks in Seattle in December, the really good, insightful blog with the sarcasm may not show up.  

So interestingly the Google revolution kept the same basic idea, which is roughly that converting human discourse or writing into computation involves looking for the most-of-the-time cases and putting them first.  Human language is trickier and more interesting and variegated than this approach, of course, which is the key to understanding what may be next in search.  Intrinsic quality is a property of the way a document is written.  Many computer scientists avoid this type of project, feeling it's too hard for computation, but in principle it's a syntactic property of language (and hence should be translated into computer code).  Consider the following writing about, say, "famous writers who visited or lived in Big Sur, California."

Exhibit A
"I heard lots of really good writers go to Big Sur.  This makes sense to me, because the ocean is so peaceful and the mountains would give them peace to write.  Plus the weather is warm.  I can imagine sitting on the beach with a notepad and writing the next great novel at Big Sur.  And later my girlfriend and I would eat S'mores and build a fire.  My girlfriend likes to camp, but she doesn't hike very much.  So when I write she'd be at the camp maybe I don't know.  Anyway I should look up all the writers who went there because there must be something to it."


What's wrong with Exhibit A?  Nothing, really.  It's just, well, trivial.  It's Trivial Content.  But why?  Well, the author doesn't really say that much, and what he does say is general and vague.  He doesn't seem to know much about Big Sur, except that it's located near the ocean and it's forested, and other common pieces of knowledge like that you can camp and hike there.  He also doesn't seem to know many details (if any) about the writers who've spent time in Big Sur, or why they did.  In short, it's a vague piece of writing that demonstrates no real knowledge of the topic.  Enough of Exhibit A then.  

Exhibit B 

"BIG SUR, Calif. — The road to Big Sur is a narrow, winding one, with the Pacific Ocean on one side, spread out like blue glass, and a mountainside of redwood trees on the other.
The area spans 90 miles of the Central Coast, along Highway 1. Los Angeles is 300 miles south. San Francisco is 150 miles north. There are no train stations or airports nearby. Cell phone reception is limited. Gas and lodging are pricey."
"Venerated in books by late authors Henry Miller and Jack Kerouac, it's no wonder then that Big Sur continues to be a haven for writers, artists and musicians such as Alanis Morissette and the Red Hot Chili Peppers, all inspired by a hybrid landscape of mountains, beaches, birds and sea, plus bohemian inns and ultra-private homes."
"In the 1920s, American poet Robinson Jeffers meditated about Big Sur's "wine-hearted solitude, our mother the wilderness" in poems like "Bixby's Landing," about a stretch of land that became part of Highway 1 and the towering Bixby Bridge 13 miles south of Carmel. (Part of the highway near that bridge collapsed due to heavy rains this past spring, followed by a landslide nearby; the roadway reopened recently.)"
"Among literary figures, Miller probably has the strongest association with the area. "Big Sur has a climate all its own and a character all its own," he wrote in his 1957 autobiographical book "Big Sur and the Oranges of Hieronymus Bosch." "It is a region where extremes meet, a region where one is always conscious of weather, of space, of grandeur, and of eloquent silence."
Miller, famed for his explicit novel "Tropic of Cancer," lived and worked in Big Sur between 1944 and 1962, drawn to the stretch of coast's idyllic setting and a revolving cadre of creative, kind, hard-working residents."

 What's better about Exhibit B?  Well, it's specific.  Qualitatively, the author (Solvej Schou, from the AP.  The full story appears in the Huffington Post) has specific facts about Big Sur and about the writers who've spent time there.  The paragraphs are full of details and discussion that would, presumably, be appreciated by anyone who queried about writers at Big Sur.  But quantiatively, or we should say here syntactically, the paragraphs are different than Exhibit A too.  Exhibit A is full of common nouns ("camp", "hike", "ocean", "writers") and it's relatively devoid of proper nouns that pick out specific places or people (or times, or dates).  Also, there are no links going out of Exhibit A--not links to Exhibit A, but links from Exhibit A--to other content, which would embed the writing in a broader context and serve as an external check on its content.  Syntactically, there's a "signature" in other words, that serves as a standard for judging Exhibit B superior to Exhibit A.  Key point here is "syntactic", because computers process syntax--the actual characters and words written--and so the differences between the two examples are not only semantic, and meaningful only to human minds.  In other words, there's a perfectly programmable, syntactic "check" on page quality, it seems, which is intrinsic to the Web page.  (Even in the case of the links we mentioned in Exhibit B, they're outbound links from the document, and hence are intrinsic to the document as well.)

In closing, I'd like to make a few broadly philosophical comments about the terrain we've covered here with our discussion of intrinsic quality above.  If you've spent time reading about "Web revolutions" and movements and fads (they're usually "revolutions") from thinkers like Shirky or any of a number of Web futurists, you're always led down the road toward democratization of content, and the "wisdom of crowds" type of ideas, that tend naturally to undervalue or ignore individual expertise in favor of large collaborative projects, where content quality emerges out of the cumulative efforts of a group.  Whereas group-think is terrible in, say, entrepreneurial ventures (and at least in lip service is bad in large corporations), it's all the rage for the Web enthusiasts.  I mentioned before that an iconoclast like Lanier calls this the "hive mind" mentality, where lots of individually irrelevant (if not mindless) workers collectively can move mountains, creating Wikipedia, or developing open source software like Linux.  The Web ethos, in other words, doesn't seem too inviting for the philosophical themes introduced here:  a verifiable check on document quality (even if not perfect, it separates tripe like Exhibit A from something worthy of reading like Exhibit B), and along with it some conceptual tip of the hat to actual expertise.  It doesn't seem part of the Web culture, in other words, to insist that some blogs are made by experts on the topics they address, and many others are made by amateurs who have little insights, knowledge, or talents.  It's a kind of Web Eliticism, in other words, and that seems very un-Web-like.

Only, it's not.  Like with the example of Yelp, where a reviewer has a kind of "circumstantial" expertise if they've actually gone to the cafe in the Mission District and sat and had an Espresso and a Croissant, there's expertise and authority stamped all over the Web.  In fact, if you think about it, often what makes the Web work is that we've imported the skills and talents and knowledge in the real world into the cyber realm.  That's why Yelp works.  And so the notion of "authority" and "expertise" we're dealing with here is relatively unproblematic.  No one gripes that they don't like their car mechanic to be an "expert" for instance;  rather, we're overjoyed when the person who fixes our ailing Volvo actually does have mechanical expertise--it saves us money, and helps assure a successful outcome.  Likewise we don't read fiction in the New Yorker because we think it's a crap shoot if it's any better than someone could write, pulled off of the street outside our apartment.  Not that New Yorker fiction is someone "better" in an objectionable, elitist way (or that the woman walking her dog out on the street couldn't be a fantastic short story writer), but only that the editors of the New Yorker should (we hope) have some taste for good fiction.  And same goes for the editorial staff of the New York Times, or contributing writers to, say, Wired magazine.  

We're accustomed to expecting quality in the real world, in other words, and so there's nothing particularly alarming about expecting or demanding it online, too.  For, from the fact that everyone can say anything about anything on the Web (which is the Web 2.0 motto, essentially), it simply doesn't follow that we all want to spend our day reading it.  For one, we can't, because there's simply too much content online these days.  But for two, and more importantly, we don't want to.  First, because life is short, and we'd rather read something that improved or enlightened or even properly amused or entertained us.  And second, because, as the recent backlash against the Web culture from Carr, Lanier, and others suggest, it's making us stupid.  And, of course, life should be too short for that.







No comments: