Interesting People mailing list archives

Re Chronicle of Higher Education: Google and the Misinformed Public

From: "Dave Farber" <farber () gmail com>
Date: Wed, 18 Jan 2017 18:13:16 -0500




Begin forwarded message:

From: Chuck McManis <chuck.mcmanis () gmail com>
Date: January 18, 2017 at 5:34:59 PM EST
To: Dave Farber <dave () farber net>
Cc: ip <ip () listbox com>
Subject: Re: [IP] Re Chronicle of Higher Education: Google and the Misinformed Public

It sounds like there's a market opportunity here for a search engine that explicitly provides context for search 
results: credibility, fact checking, bias (not as a value judgement), research articles vs. journalism reporting on 
them, etc. Could also incorporate some form of crowd sourcing, etc.

Fortunately this experiment has been done at least once. The search engine Blekko was founded by Rich Skrenta and his
friends from Topix, I joined as VP of operations (and later operations and engineering) in 2010 about 6 months before
the product officially launched in November of 2010. The search engine was predicated on the idea that much of the
content of the web was useless and that the original search engine mission of finding pages you would not normally
find, had flipped over to finding useful pages out of all the fluff. That concept was implemented as a curated index
based on web sites that had been identified by humans (content editors) and generally validated by search traffic
(click throughs).

What had been observed was that Google had created an attractive nuisance in the early 2000's where a page with
AdSense ads on it that was on the first page of results for a given topic would deliver thousands of dollars a month
in passive income, the same pages with would provide additional income with affiliated links (sending referral cash
to the referral network). Blekko defined such pages which were simply place holders to drive advertising traffic as
web "spam" in much the same vein as email spam. By 2015 Google had an estimated an index size of 10 trillion
documents, by analyzing over 200 billion pages Blekko estimated that there were no more than 100 billion web
documents with actual content. In that same year Doug Smith presented a paper he and I had authored at ICSC and Doug
and I had further refined the research to use training data from known good web pages to begin automating growing the
index with high quality documents.

As a product idea it resonated strongly with anyone who used search as part of their job. Blekko was beloved by the
reference librarians around the world, had a tremendous following with attorneys and journalists who used use for
research, and students trying to research term papers. As a business model it was less successful. Specifically the
only search 'intent' that makes money is commercial intent. Blekko was unable to pursue subscription access to the
index and because we did not index all content the engine would do poorly on topics that were not curated, or "long
tail" topics[1]. The company had a "3 card monte" game built into the interface where it would show results from
Blekko, Bing, and Google (the only 3 US based indexes available, today we're back to only 2). The user was asked to
pick the column with the "best" results and that showed a consistent correlation between contested searches, long
tail searches, and topic searches. A "contested" search was one where a lot of people were attempting to game the
algorithm (search for "no fee credit card" some time to see a good example of a contested search), in those Blekko
consistently 'won' because our index wasn't influenced by these folks and so we only returned good pages. In "long
tail" searches Google generally won, it has a really really big index. And in topic searches we typically would tie
with Bing (and beat Google) or win if the topic was one of our curated topics.

In March of 2015 IBM bought the assets of Blekko and made it part of the Watson Group where it's crawler continues to
go out and collect web pages but now does so in service of building data sets for Watson rather than a search engine.

What I learned was that it costs about $7.5M/year grossed up to operate a 5 billion page index with enough bandwidth
to serve on the order of 10M queries/day and you can't make a profit with that unless you build up your own ad
network (which Blekko never did). People love quality search results, but they aren't actually willing to pay any
money for them. (or conversely they are willing to put up with the queries that return nothing but junk on Google for
the ones that work well). Digital advertising networks are filled with people who make boiler room sellers of
sub-prime mortgages look like angels.

--Chuck

[1] A "long tail" topic is one for which only a handful of web pages exist and it is not referenced widely in the
existing web

On Wed, Jan 18, 2017 at 11:33 AM, Dave Farber <farber () gmail com> wrote:



Begin forwarded message:

From: Thomas Leavitt <thomleavitt () gmail com>
Date: January 18, 2017 at 2:26:30 PM EST
To: Dave Farber <dave () farber net>
Subject: Re: [IP] Chronicle of Higher Education: Google and the Misinformed Public

Dave,

It sounds like there's a market opportunity here for a search engine that explicitly provides context for search 
results: credibility, fact checking, bias (not as a value judgement), research articles vs. journalism reporting on 
them, etc. Could also incorporate some form of crowd sourcing, etc.

Would be an interesting technical challenge to make this applicable across a broad range of searches, and of course 
there's the business case (or lack thereof) and going up against Google. On the other hand, it seems like there's a 
real need for genuine innovation in the space, and some obvious candidates that would likely be interested in 
executing an buy out for a successful implementation prior to the company going to market.

Regards,
Thomas Leavitt

On Jan 17, 2017 10:13 AM, "Dave Farber" <farber () gmail com> wrote:



Begin forwarded message:

From: Lauren Weinstein <lauren () vortex com>
Date: January 17, 2017 at 11:20:06 AM EST
To: nnsquad () nnsquad org
Subject: [ NNSquad ] Chronicle of Higher Education: Google and the Misinformed Public


Chronicle of Higher Education: Google and the Misinformed Public

http://www.chronicle.com/article/Googlethe-Misinformed/238868

     Digital media platforms like Google and Facebook may disavow
   responsibility for the results of their algorithms, but they
   can have tremendous -- and disturbing -- social effects.
   Racist and sexist bias, misinformation, and profiling are
   frequently unnoticed byproducts of those algorithms. And
   unlike public institutions (like the library), Google and
   Facebook have no transparent curation process by which the
   public can judge the credibility or legitimacy of the
   information they propagate.  That misinformation can be
   debilitating for a democracy -- and in some instances deadly
   for its citizens.

- - -

--Lauren--
REPORT Fake News Here! - https://factsquad.com
CRUSHING the Internet Liars - https://vortex.com/crush-net-liars


Archives  | Modify  Your Subscription | Unsubscribe Now




-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/18849915-ae8fa580
Modify Your Subscription: https://www.listbox.com/member/?member_id=18849915&id_secret=18849915-aa268125
Unsubscribe Now: 
https://www.listbox.com/unsubscribe/?member_id=18849915&id_secret=18849915-32545cb4&post_id=20170118181325:B47F58FC-DDD3-11E6-AE4C-E326D0A49613
Powered by Listbox: http://www.listbox.com

Current thread:

Re Chronicle of Higher Education: Google and the Misinformed Public Dave Farber (Jan 18)
- <Possible follow-ups>
- Re Chronicle of Higher Education: Google and the Misinformed Public Dave Farber (Jan 18)