By adding this information as a Metadata to a article, we create a graph of all of the world written knowledge. Source NyTimes / Visualisation using demos.explosion.ai

By adding this kind of contextual information as Metadata to a article, we will create a graph of all of the world written knowledge. This beautiful visualization made by Jer Thorp ilustrate just one of the things that you can do if you have the data.

I was asked by the publishers of Popular Science magazine to produce a visualization piece that explored the archive of their publication. PopSci has a history that spans almost 140 years. Working with Mark Hansen, I ended up making a graphic that showed how different technical and cultural terms have come in and out of use in the magazine since it’s inception. — Jer Thorp

The problem is that now this is a manual process, i can scrape one website, apply Natural Language Processing(NLP) and extract the data.

After, the second person that wants to do the same, has to start all over again. If we make the task of adding this information to the persons that are writing the article, this information's will become available to everybody that will access the metadata of any published article in the world.

Part 1. History

Most persons are familiar with what HTML tags are. Usually, HTML tags tell the browser how to display the information included in the tag. For example, <h1>Avatar</h1> tells the browser to display the text string "Avatar" in a heading 1 format. However, the HTML tag doesn't give any information about what that text string means—"Avatar" could refer to the hugely successful 3D movie, or it could refer to a type of profile picture—and this can make it more difficult for search engines to intelligently display relevant content to a user.

Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!

Why use microdata?

Your web pages have an underlying meaning that people understand when they read the web pages. But search engines have a limited understanding of what is being discussed on those pages. By adding additional tags to the HTML of your web pages — tags that say, “Hey search engine, this information describes this specific article, place, or person,the original source of the article — you can help search engines and other applications better understand your content and index it in a useful, relevant way. Microdata is a set of tags, introduced with HTML5, that allows you to do this.

1a. Why use microdata?

Your web pages have an underlying meaning that people understand when they read the web pages. But search engines have a limited understanding of what is being discussed on those pages. By adding additional tags to the HTML of your web pages — tags that say, “Hey search engine, this information describes this specific movie, or place, or person, or video” — you can help search engines and other applications better understand your content and display it in a useful, relevant way. Microdata is a set of tags, introduced with HTML5, that allows you to do this.

What can we do if we have this data ?

In 2009, Jer Thorp created this stunning visualization, showing the evolution in time of the number of articles written by the NyTimes about the US presidents from 1984 until 2009, using the NyTimes API.

With the Open Graph Publishing Standard, any publication can contribute this information, allowing data scientists to get a better understanding of the events and actions that make up our world.

1b. itemscope and itemtype

Let’s have a concrete example. You have a article about Donald Trump — a article with a link to a source citation, information about the persons in the article, and so on. Your HTML code might look something like this:

<div>

<h1>Trump and the rise of populism in the World</h1>

<span>Author: Steve Justin</span>

<span>In the past few weeks [...]</span>

<a href="http://website.com/originalSource.html">Source</a>

</div>

To begin, identify the section of the page that is “about” the movie Avatar. To do this, add the itemscope element to the HTML tag that encloses information about the item, like this:

<div itemscope>

<h1>Trump and the rise of populism in the World</h1>

<span>Author: Steve Justin</span>

<span>In the past few weeks [...]</span>

<a href="http://website.com/originalSource.html">Source</a>

</div>

By adding itemscope , you are specifying that the HTML contained in the <div>...</div> block is about a particular item.

But it’s not all that helpful to specify that there is an item being discussed without specifying what kind of an item it is. You can specify the type of item using the itemtype attribute immediately after the itemscope .

og:isBasedOn"content="http://cnn.com/

OriginalSource.html">

---------------------------------------------------------

<div itemscope itemtype="https://schema.org/NewsArticle">

<h1>Trump and the rise of populism in the World</h1>

<span>Author: Steve Justin</span>

<span>In the past few weeks [...]</span>

</div> --------------------------------------------------------- Trump and the rise of populism in the World Author: Steve Justin In the past few weeks [...]

This specifies that the item contained in the div is in fact a News Article, as defined in the schema.org type hierarchy. Item types are provided as URLs, in this case http://schema.org/NewsArticle .

Part 2. Add metadata about what you post.

The HTML <head> Element

The <head> element is a container for metadata (data about data) and is placed between the <html> tag and the <body> tag.

HTML metadata is data about the HTML document. Metadata is not displayed.

Metadata typically define the document title, character set, styles, links, scripts, and other meta information.

The HTML <author> Element

The <author> element defines the author of a page:

<meta name=”author” content=”Hege Refsnes”>

Part 3 .The Open Graph Publishing Standard

The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.

75% of the top 100 online NewsPapers in the US are already using the Open Graph

I scraped the top 100 Newspapers in the US to see just how many of the them are using the Open Graph already.

Now they provide only the bare minimum:

Title, description and type of the article.

One big exception here is NyTimes, were every article contains information's about the persons in that article, the location or geographical context of the article, the Country or Countries cites in the article,etc

To turn your web pages into graph objects, you need to add basic metadata to your page. You place additional <meta> tags in the <head> of your web page. The four required properties for every page are:

og:title - The title of your object as it should appear within the graph, e.g., "Fidel Castro Dies".

- The title of your object as it should appear within the graph, e.g., "Fidel Castro Dies". og:type - The type of your object, e.g., "article". Depending on the type you specify, other properties may also be required.

- The type of your object, e.g., "article". Depending on the type you specify, other properties may also be required. og:image - An image URL which should represent your object within the graph.

- An image URL which should represent your object within the graph. og:url - The canonical URL of your object that will be used as its permanent ID in the graph, e.g., "http://www.nytimes.com/2016/11/26/world/americas/fidel-castro-dies.html".

As an example, the following is the Open Graph protocol markup for Fidel Castro Dies on New York Times:

<html>

<head>

<title> Fidel Castro Dies </title>

<meta property="og:title" content=" Fidel Castro, Cuban Revolutionary Who Defied U.S., Dies at 90 " />

<meta property="og:type" content=" article " />

<meta property="og:url" content=" http://www.nytimes.com/2016/11/26/world/americas/fidel-castro-dies.html/ " />

<meta property="og:image" content=".. //images/world/Fidel-Castro-obituary.jpg " />

...

</head>

...

</html>

Other Tags

article - Namespace URI: http://ogp.me/ns/article#

article:published_time - datetime - When the article was first published.

- datetime - When the article was first published. article:modified_time - datetime - When the article was last changed.

- datetime - When the article was last changed. article:author - profile array - Writers of the article.

- profile array - Writers of the article. article:section - string - A high-level section name. E.g. Technology

- string - A high-level section name. E.g. Technology article:tag - string array - Tag words associated with this article.

- string array - Tag words associated with this article. copyright - string - Copyright type of the article.

- string - Copyright type of the article. photoSource - string array - The photo(s) source publication or name used with this article.

- string array - The photo(s) source publication or name used with this article. photoSourceURL - URL array - The Url of the photo(s) associated with this article.

- URL array - The Url of the photo(s) associated with this article. language - string - The main language used in the article.

- string - The main language used in the article. article:tag - string array - Tag words associated with this article.

The Distribution of ideas and articles in the online world.

A Open Graph Publishing Standard schema used by every website allows a 10X increase of contextual data for investigative journalists and data scientists that are using Natural Language Processing(NLP) to identifying fake and misleading news articles online.

The current approach is to get to every single web platform and see what is the API that they are using, create a parser for that website and apply NLP to understand what are the contextual information's from that article.

Discover original source, publishing data, location and language of the article.

og:datePublished - The date when the article was published.

- The date when the article was published. og:isBasedOn - A resource that was used in the creation of this resource. This term can be repeated for multiple sources.

- A resource that was used in the creation of this resource. This term can be repeated for multiple sources. og:locale - The locale these tags are marked up in. Of the format language_TERRITORY . Default is en_US .

- The locale these tags are marked up in. Of the format . Default is . og:contentLocation - The location depicted or described in the content.

Discover more information's about the Article.

article:author - profile array - Writers of the article.

- profile array - Writers of the article. article:section - string - A high-level section name. E.g. Technology

- string - A high-level section name. E.g. Technology article:tag - string array - Tag words associated with this article.

- string array - Tag words associated with this article. og:publisher

Discover what Brands, Institutions, Law, Person, Product or Service are refferenced in a article.

og:institutionBrandsCited - The institutions or Brands that are cited in the article. For example, White House.

- The institutions or Brands that are cited in the article. For example, White House. og:institutionBrandsCitedWikidata - The wikidata ID of the institutions or brands that are cited in the article.

- The wikidata ID of the institutions or brands that are cited in the article. og:PropositionLawCited - The proposition or law that is cited in this Article. For example, Prop 51,CA.

- The proposition or law that is cited in this Article. For example, Prop 51,CA. og:PropositionLawCitedWikidata - The wikidata ID of the proposition or law that is cited in this Article.

- The wikidata ID of the proposition or law that is cited in this Article. og:personsCited - The persons that are cited in the article.

- The persons that are cited in the article. og:personsCitedLink - Personal or profesional link of the user cited.

- Personal or profesional link of the user cited. og:productsServicesCited - The persons that are cited in the article.

- The persons that are cited in the article. og:productsServicesCitedWikidata - Personal or profesional link of the user cited.

Discover more about the Author.

og:article:author - Author Name.

- Author Name. og:article:author:feed - A RSS feed of posts posted by the author in this site.

- A RSS feed of posts posted by the author in this site. og:article:author:link - Author personal or profesional page.

With every blog and organization that will join and start using this standard will be able to get a better understanding of what is posted on the internet.

If you know journalists or bloggers, share this article with them. I would love to hear other opinions about this proposal.

This is a work in progress, join the conversation here

About Me

I`m part of the Organised Crime and Corruption Reporting Projects (OCCRP), were i do data analysis and pattern recognition to uncover patterns of corruption in unstructured datasets.

You can find me online on Medium Florin Badita, AngelList, Twitter , Linkedin, Openstreetmap, Github, Quora, Facebook

Sometimes i write on my blog http://florinbadita.com/