



Install HTML Agility Pack

PM> Install-Package HtmlAgilityPack





using HTMLAgilityPack;

HtmlWeb web = new HtmlWeb();

HtmlDocument doc = web.Load("http://www.codingdefined.com");

public HtmlDocument Load(string url, string method);

// method as GET, POST, PUT etc

public HtmlDocument Load(string url, string method, WebProxy proxy, NetworkCredential credentials);

// method, proxy and the credentials for authenticating

// proxy host, proxy port, userid and password for authentication

// Select all the div's having class hentry

// Select all div's having id = main

doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//a")

// Select all the a's inside class hentry

var node = doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();

Example - Scraping CodingDefined.com home page to get post name and link





using System;

using System.Collections.Generic;

using System.Linq;

using System.Web;

using System.Web.UI;

using System.Web.UI.WebControls;

using HtmlAgilityPack;





public partial class _Default : System.Web.UI.Page

{

protected void Page_Load(object sender, EventArgs e)

{

HtmlWeb web = new HtmlWeb();

HtmlDocument doc = web.Load("http://www.codingdefined.com");

doc.DocumentNode.Descendants().Where(a => a.GetAttributeValue("class", "").Equals("hentry")).Single();

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[contains(@class,'hentry')]//h3//a"))

{

HyperLink link = new HyperLink();

link.Text = node.InnerHtml;

link.NavigateUrl = node.Attributes["href"].Value;

PlaceHolder1.Controls.Add(link);

PlaceHolder1.Controls.Add(new LiteralControl("<br />"));

}

}

}

In this post we will be discussing about how to do webscraping in Asp.Net using HtmlAgilityPack. Web Scraping is a technique of extracting information from websites. As the volume of data on the web has increased, this practice has become increasingly widespread, and a number of powerful services have emerged to simplify it. HTML Agility Pack is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. In simple terms its an .NET library which allows you to parse the files on the web.You can install HTML Agility Pack using Nuget package manager. To install it you need to run below command in Package Manager ConsoleAfter adding the reference via Nuget, you need to include the reference in your page using the following :At first you need to create an instance of HtmlWeb which is a utility class to get the document from HTTP. Now using the Load function of HtmlWeb you need to load the entire HTML document as shown below :If you need you can use any of the below Load functionNext thing is to get the specific div or span by id or class for that you will select the nodes as shown below :Alternatively you can query it using the LINQ queryIf you check the home page of Coding Defined, you will see all the posts is inside a div having class name hentry and post. So at first we will get all the nodes having class hentry. Next thing is to get the h3 tag of that div and then finally a tag to get the title and href. So the code to get all the details are :In the above code we are getting the Title and Href and saving the information in a HyperLink and adding that to the placeholder.Results :Please Like and Share the CodingDefined Blog, if you find it interesting and helpful.