December 30, 2021

Web Scrapping in C# using Scraper API and HTMLAgilityPack

Overview:

In this Article, we will explore C# and how to create real life web scrapper using Scraper API and HtmlAgilityPack.

Requirement:

Recently, we have implemented a Scraper API to get data from different pages and dump into either CSV file OR Database for wholesale sourcing and ordering processing of the hospitality company based out Richmond, VA, USA.

Introduction:

What is Scraper API?

Scraper API is used to extract data. Its special purpose is to download large amounts of raw data easily and quickly. It is easy to use. We can scrape by sending the URL you would like to scrape to the API along with your API key and the API will return the HTML responses from the URL you want to scrape.

Get API Key from Scraper API?

We need to pass the API key with each Scraper API request to authenticate requests. For that, you need to sign up for an account here. After signing up on Scraper API, you will get 5000 free requests for a trial.

WEB SCRAPING USING C#

We are going to perform scraping with HTML parsing. We are going to extract data from https://coinmarketcap.com.
This website holds the information of cryptocurrencies, like name, current price, the percentage change in the last 24hrs, 7 days, market capital, etc.

Step-1: Create Project

First, create a project, here we are choosing Console App (.NET Core). You can choose the project template based on your requirement. Right now, our focus is web scraping so, skipped project creation steps.

Step-2: Install NuGet Packages

We require to install the following NuGet packages:
  1. ScraperAPI: This is the official C# SDK for the Scraper Api.
  2. HtmlAgilityPack: It is a .NET code library that allows you to parse "out of the web" HTML files.
Open Package Manager Console and run the below command one by one,
Install-Package ScraperApi
Install-Package HtmlAgilityPack

Step-3: GetDataFromWebPage() method

In this example, we are going to use the HttpClient and ScraperApiClient.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
static async Task GetDataFromWebPage() {
  string apiKey = "cb4b88b493a5d003efadb120698c1f14";
  HttpClient scraperApiHttpClient = ScraperApiClient.GetProxyHttpClient(apiKey);
  scraperApiHttpClient.BaseAddress = new Uri("https://coinmarketcap.com");

  var response = await scraperApiHttpClient.GetAsync("/");
  if (response.StatusCode == HttpStatusCode.OK) {
    var htmlData = await response.Content.ReadAsStringAsync();
    ParseHtml(htmlData);
  }
}

Replace your Scraper API key with “apiKey” variable.

GetProxyHttpClient() is used to create a HTTP client with the scraperapi.com.

GetAsync() will fetch the data from the website and store in local variable. proxy.

Step-4: ParseHtml() method

Once get html data from Webpage, parsing it using the HTMLdocument method. It comes from HtmlAgilityPack.

Next step, load html data and get the ‘tbody’ html tag from it. The tbody tag contains the rows of Cryptocurrency data.

To get more data, use selectSignleNode method. It will return the first HtmlNode that matches the XPath query, it will return a null reference if the matching node is not found. SelectNodes is a collection of Html Nodes.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
static void ParseHtml(string htmlData) {
  var coinData = new Dictionary < string,
    string > ();
  HtmlDocument htmlDoc = new HtmlDocument();
  htmlDoc.LoadHtml(htmlData);

  var theHTML = htmlDoc.DocumentNode.SelectSingleNode("html//body");
  var cmcTableBody = theHTML.SelectSingleNode("//tbody");
  var cmcTableRows = cmcTableBody.SelectNodes("tr");
  if (cmcTableRows != null) {
    foreach(HtmlNode row in cmcTableRows) {
      var cmcTableColumns = row.SelectNodes("td");
      string name = cmcTableColumns[2].InnerText;
      string price = cmcTableColumns[3].InnerText;
      coinData.Add(name, price);
    }
  }
  WriteDataToCSV(coinData);
}

Step-5: WriteDataToCSV() method

In this example, we have taken the currency name and its price from the scraped data, and store in CSV file.
1
2
3
4
5
6
7
8
9
static void WriteDataToCSV(Dictionary < string, string > cryptoCurrencyData) {
  var csvBuilder = new StringBuilder();

  csvBuilder.AppendLine("Name,Price");
  foreach(var item in cryptoCurrencyData) {
    csvBuilder.AppendLine(string.Format("{0},\"{1}\"", item.Key, item.Value));
  }
  File.WriteAllText("C:\\Kishan\\Webscraping.csv", csvBuilder.ToString());
}

Step-6: Main() method

Replace content of Main method with below code:
1
2
3
static async Task Main(string[] args) {
  await GetDataFromWebPage();
}

Output

We can see CSV file as output that contains two columns, 1) Currency Name and 2) Price. We can get required column and data based on our need.
Webscraping-csv










Conclusion

In this blog, with the help of ScraperAPI and HtmlAgility Nuget Packages, we can scrap the data from site, filter the require data and dump into CSV file.

If you have any questions you can reach out our SharePoint Consulting team here.

No comments:

Post a Comment