Internet Marketing Monitor
December 19, 2006
Filed Under (Site Design, Search Engines, SEO Tips, Google) by Matt / Derick on 12-19-2006

Have you ever done an online search and found that the results returned were all pretty much the same thing?  Maybe you've done a search for your content and discovered that both the regular version and the printer-friendly version are showing up in the results?  That's not good news for most website owners because printer-friendly versions of content don't usually include active links… so people can't use them to discover your other content.

Duplicate content can be a problem, especially when it's not intentional (we'll just assume that none of you would duplicate content to try to manipulate the search engine rankings).  Google addressed the issue yesterday afternoon and outlined how they deal with duplicate information.  And while all search engines operate differently, Google's approach should be viewed as fairly standard and indicative of similar policies you'd probably find at the other search engines.

Defining the Problem

Duplicate content, in the world of Google, is generally defined as content that is either identical verbatim or "appreciably similar".  Examples provided by Google include content generated for mobile use, store items available at two distinct URLs, or content that is automatically displayed in multiple formats (blog posts, printer-friendly versions, etc).  Versions of your content in different languages and "snippets" or quotes from other sites are not considered duplicate content.

Google says that it addresses the issue for a couple of reasons.  For one, some website owners will intentionally duplicate content to try to manipulate the system.  I know none of you would try that.  But sadly, some people do.  In addition, Google wants to make sure that your best foot is being put forward in its SERPs.  You don't want the mobile version of your content showing up at a URL that's 6000 characters long when there's a standard version at a much shorter URL available.

Google's Approach

Google's crawling and indexing technology tries to determine what is distinct information and what isn't.  When someone searches for "peanut" and the search returns 50 pages with the exact same information about peanuts on them, Google is most likely going to show you one of those.  If all 50 of those pages are yours, in different languages and formats, it's pretty much a hit and miss situation as to which version Google will choose to show.

If the search company thinks someone is trying to manipulate things, they will adjust the index and ranking of the guilty sites.  According to the post, Google tries to filter information before it dings a page's ranking, but filtering doesn't always work.  Manipulative people will continue to try to find ways around those filtering techniques.  Google has little choice but to intervene.  Now do you see why it's so important to get your duplicate content under control?  Do you really want the granddaddy of all search engines mistaking your website(s) for cheaters?

Your Job

Google provided a fairly detailed list of steps webmasters can take to make sure their content isn't incorrectly flagged as malicious or displayed in unpleasant ways on the SERP:

  1. Use robots.txt to tell Google (and other crawlers) what to index and what not to index.
  2. If you've moved a bunch of stuff around on your site, use 301 RedirectPermanent's to let everyone - crawlers & visitors - know about the changes.
  3. Use the same linking approach everywhere.  If you're going to use /page, use that one everywhere.  Don't mix link styles (/page, /page/, page/something.html).
  4. Top-level domains are best for multilingual sites.  Google will probably recognize .de and .jp as different language versions.
  5. Watch syndications for linkage back to your websites.  If you're sending your content out to other sites, and they're not linking back to show ownership, it looks like you're just trying to duplicate content.
  6. If you use Google Webmaster Tools, let the search crawler know if you prefer www.domain.com or domain.com.  That way, Google will index all of your pages the way you prefer, regardless of how links on other sites look.
  7. Avoid repeating the same information on every webpage.  Google's example is good:  copyright information on every page.  Don't put the full copyright text on every page.  Instead, put a link to a single page with your full copyright text.
  8. Don't use placeholder pages.  If something isn't available yet, don't publish the page until it's ready (whoops).  As tempting as it might be to stick an empty "Coming soon" page out there… wait.
  9. Master your content management system.  That is, know exactly how it publishes your content.
  10. Ignore content-scraping websites.  It's annoying, I know.  But Google says that it's highly unlikely that duplicate content on scraping sites will affect your indexing or ranking.  If it does, file a DMCA request.

The ball is in our court.  As webmasters, it's our job to make sure Google's crawler is getting the best of our content to index.  Here are 10 ways to make sure that happens.

Related Posts & Pages Recent Posts



Post a comment
Name: 
Email: 
URL: 
Comments: