Getting into Google News

Presentation by David Meredith at Newstools and Drupal Day

Held at Yahoo in Sunnyvale
Each site is evaluated by a moderator
What do we look for?
Original content
Multiple authors
Proper Attribution
Response time
Not all sites make the cut, but most legitimate news site do.

How do we get our data

once in the site database:
we crawl the content
figure out which pages are articles


sitemaps are essentially feed of semi-structured data for crawlers to

I cannot express to you what a boon sitemaps have been to us! It’s a basic
designed to be ingested by a crawl. A list of all articles you have
published, dates, names, etc. It’s an open sic and Google uses it. Saves
Google time in crawling your site.

What does this have to do with Drupal?
–Drupal builds sitemaps.

There is a google site module for drupal

That’s getting your data into google news, now about making your data

II.  Now about search

metions “co-op” program for search

A. Syndicating from Google News
1. feeds are offered for every section or query that you can access as a
2. if you can construct a query to get the result you want, you can get a
feed of the data.
3. feeds are offered is RSS, Atom and XML

B. google news facebook applications
1. we built one entirely on top of public feeds
2. newsmap, uses static data

C. News Data API
1. more interactive API is in the works
2. many unanswered questions:

-what data do publishers need?
-what data do developers need?
-Can we give it to them?

What do you want to know about your own content do you want to know (as a
publisher and publisher?)

Clear Forest API, Reuters is starting to use it.

New approaches: trying to approach synethesized news.

NewsKnight (?) scrapes the content

Google is more clustering based, not classification based.

Very fine grain taxonomies are expensive to maintain and don’t work well.

Google doesn’t publish their process because it comes down to spam.

site: your domain name to check if your site is being queried by Google

in the site map label things that are news and things that are not news.

we like timely good upates. A good update is when the article has a
substantial change to it. But what it will appear to the system is a stream
of duplicates.

Press releases are not useful to users, but they may be useful to

authenticated response is still there and is still going for google news

how would you categorize education essays?

How could are newspapers in general at search optimization.

SEO doesn’t make sense for news. It’s not something people spend a huge
amount of time on unless they are trying to subvert the index.

There are two tiers, NY Times, Washington Post.

Is Drupal adding micro-formats?

Google in general but google news does not.

“We are not journalists, we don’t have a single journalist on staff, we just
build the technology to help the news.”

