Web Scraping using jQuery

Devin Manno

September 20, 2012

Background

I’ve recently been diving into the world of web scraping at work. “Web scraping” refers to extracting information from web pages. Web scraping using jQuery is a simple way to extract targeted information using client-side scripts that can be easily integrated into existing web applications. jQuery is a free JavaScript library that allows the developer to easily search for (query) elements on a webpage using search strings. The library also includes many other powerful features; see www.jQuery.com for the official documentation.

In order to ‘scrape’ a webpage, you first need to request the page itself. The jQuery load() function accomplishes this using an Ajax HTTP request, and can even extract specific page fragments based on a query. The jQuery website includes an important note that there is a browser restriction that confines Ajax requests to data on the same domain, sub-domain and protocol. This is a deal-breaker for some applications, but it worked for me, as I needed to access data from the same sub-domain. At work, I wanted to do two things with jQuery: 1) obtain a variable ID for a .csv file in a storefront ordering system, and 2) load the file from the file’s URL, which is based on the unique ID number. I had to do this within a 3rd party platform that was hosted offsite, so being able to find a solution that could be integrated with a preexisting web page was vitally important.

Solution

I first used the jQuery load() function to access a storefront order page that contained the variable ID for the .csv file. Inside of the URL parameter of the load() call, you can add a space character followed by any jQuery selector string in order to return a page fragment instead of the entire page. The entire page is still fetched, so you’re not saving time or bandwidth in the actual page request, but returning only the fragment you’re interested in saves time after the page is retrieved. Note that, according to the jQuery site, if you do add a selector to the URL parameter of the load() call, scripts inside of the loaded page will not execute. You would want the scripts to execute if, for example, you were loading a webpage into an HTML element and you wanted it to display the entire page with scripts. I added a search string selector for my purposes, ‘input[name=“csvid”]’, which loaded the page and then selected and returned only the input control with a name attribute equal to “csvid”:

var orderPageURL = ‘relativeOrderPageURL input[name=“csvid”]’;
 
$('#results').load(orderPageURL, function(){
     /* the input control with an id of “csvid” from the 
     order page is loaded into a div on the current page
     with an id of “results”.*/
 
     var csvId = $('#results input[name="csvid"]').val(); 
     /* gets the .csv identifier, which was stored in the 
     “value” attribute of the hidden input control */
}

Once I obtained the .csv identifier from the input control’s value attribute on the order page, I could make a jQuery get() call to load the .csv file itself. The jQuery get() function uses an Ajax HTTP GET request.

var orderPageURL = ‘relativeOrderPageURL input[name=“csvid”]’;
 
$('#results').load(orderPageURL, function(){
      /* the input control with an id of “csvid” from the 
     order page is loaded into a div on the current page
     with an id of “results”.*/
 
     var csvId = $('#results input[name="csvid"]').val(); 
     /* gets the .csv identifier, which was stored in the 
     “value” attribute of the hidden input control */
 
     var csvURL = "fileDownloadURL/file" + csvId + ".csv"; 
     /* the relative URL for the .csv file */
 
     $.get(csvURL, function(csvFile){
          /* the file is downloaded via HTTP GET into 
          the variable csvFile */
 
          /* process the file as needed here */
     }
}

Upon successfully downloading the file using jQuery’s get() function, the callback function is executed. This is where I further processed the file as needed.

Conclusions

jQuery is a powerful tool that can be used for web scraping within a sub-domain for client-side applications. Using the load() and get() functions is a fairly straightforward way to download content from a matching sub-domain for use within the browser. Once downloaded, selectors offer a powerful way to access specific information from the loaded data. All of this can be done from a single .html file, making it an ideal choice for developers working within the confines of a 3rd party platform.

Background

Solution

Conclusions

Leave a Reply Cancel reply

Recent Posts

Categories

Meta

Archives

Background

Solution

Conclusions

Leave a Reply Cancel reply

Recent Posts

Categories

Tags

Meta

Archives