- Hide menu
Presentation by David Meredith at Newstools and Drupal Day
Held at Yahoo in Sunnyvale
Each site is evaluated by a moderator
What do we look for?
Not all sites make the cut, but most legitimate news site do.
How do we get our data
once in the site database:
we crawl the content
figure out which pages are articles
Use a SITEMAP!
sitemaps are essentially feed of semi-structured data for crawlers to
I cannot express to you what a boon sitemaps have been to us! It’s a basic
designed to be ingested by a crawl. A list of all articles you have
published, dates, names, etc. It’s an open sic and Google uses it. Saves
Google time in crawling your site.
What does this have to do with Drupal?
–Drupal builds sitemaps.
There is a google site module for drupal
That’s getting your data into google news, now about making your data
II. Now about search
metions “co-op” program for search
A. Syndicating from Google News
1. feeds are offered for every section or query that you can access as a
2. if you can construct a query to get the result you want, you can get a
feed of the data.
3. feeds are offered is RSS, Atom and XML
B. google news facebook applications
1. we built one entirely on top of public feeds
2. newsmap, uses static data
C. News Data API
1. more interactive API is in the works
2. many unanswered questions:
-what data do publishers need?
-what data do developers need?
-Can we give it to them?
What do you want to know about your own content do you want to know (as a
publisher and publisher?)
Clear Forest API, Reuters is starting to use it.
New approaches: trying to approach synethesized news.
NewsKnight (?) scrapes the content
Google is more clustering based, not classification based.
Very fine grain taxonomies are expensive to maintain and don’t work well.
Google doesn’t publish their process because it comes down to spam.
site: your domain name to check if your site is being queried by Google
in the site map label things that are news and things that are not news.
we like timely good upates. A good update is when the article has a
substantial change to it. But what it will appear to the system is a stream
Press releases are not useful to users, but they may be useful to
authenticated response is still there and is still going for google news
how would you categorize education essays?
How could are newspapers in general at search optimization.
SEO doesn’t make sense for news. It’s not something people spend a huge
amount of time on unless they are trying to subvert the index.
There are two tiers, NY Times, Washington Post.
Is Drupal adding micro-formats?
Google in general but google news does not.
“We are not journalists, we don’t have a single journalist on staff, we just
build the technology to help the news.”
Get every new post delivered to your Inbox
Join other followers