Duplicate content is defined as content that appears on two or more URLs with a single path. By “same content” we mean blocks of content that are “substantially similar”, which can range from exact copies to paraphrased content.
Duplicate content can be internal: a page that contains content that is substantially identical to another internal page on your site.
Duplicated content can be external: it is defined as content plagiarized on another web page which you are not the owner.
Very penalizing as it is difficult to counter it, the content that is plagiarized on a site with a higher authority than yours has a great chance to be ignored by Google. And this, no matter who published the content first. The authority is the law in the choice of content displayed and sectioned by Google and not the precedence.
What types of content are considered duplicate?
Plagiarized content from an identifiable source/url: This type of duplicate content is the easiest to detect because it is often a simple word-for-word copy/paste from one page to another.
Paraphrased content, which has been slightly modified from the original: A little harder to detect, even if this content has been slightly rewritten, usually using “search and replace” functions for individual words or even by spinning over entire sentences. Google calls this “minimally modified copy” content; it easily detects it via the Ngram, and arbitrarily labels its pages as similar.
We have even observed in recent months that when Google detects too many external pages with similar content, in some cases it simply de-indexes them and forgets about them permanently.
Why does Google care about duplicate content?
Google has a problem with duplicate content for three reasons:
It still has a hard time determining which page is the original: Priority|Authority
Google limits the display of similar content to improve the quality of its search index.
Google has big gaps in its ability to identify and understand canonical markup: loss of authorship|hijacking
Now, every day, hundreds of millions of pages are plagiarized on the fly, and sent back to the index. Even if Google eliminates and detects 99% of this kind of spam, it happens very regularly that perfectly legitimate sites are still victims of this kind of plagiarism. Indeed, without even knowing it, some websites are penalized, and some of their pages are considered as similar to others that have plagiarized it!
There is no court to decide who plagiarized or who is responsible. It is the Google algorithm that decides for everyone. And sometimes wrongly by penalizing the poor legitimate site! And know that, this victim site can very well become yours one day.
How to limit/stop the risks of plagiarized content?
It is difficult, if not impossible to prevent malicious people from copying your content manually or automatically (scraper robot), but there are some solutions you can put in place to limit the risk of loss of authorship.
If you have similar content on multiple pages of your website, or want to indicate which page Google should prioritize, using the canonical URL tells search engines which page and to select/prefer.
A DMCA badge is a protective seal placed on your website that deters thieves from stealing your content. This service offered by https://www.dmca.com/ works quite well. But when the damage is done, sometimes it’s hard to go back.
Tool that allows you to monitor and detect plagiarized and duplicated content. With this kind of tool you can monitor your content, monitor the risks of plagiarism, and detect possible duplicate content on third party sites. Ex: Copyscape; DCChecker; Killduplicate or PageVerify
ScrapyLeaks Plugin :
Our plugin allows you to blur the lines and make your content unduplicable!
Whether it is by copy/paste or scraper robots, your content is protected!