Sunday, 30 November 2014

Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

  •     iPhone (including 4, 5, 5s, 5c, and 6)
  •     Samsung (including Galaxy, S4, S3, Note)
  •     Siri
  •     iPad Cases
  •     Snapchat
  •     Google Glass
  •     Apple iPad
  •     BlackBerry Z10
  •     Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon)  let’s move 5. Train employees Other interesting declaration  was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

Source:http://thewebminer.com/blog/2013/12/

Thursday, 27 November 2014

Scraping R-bloggers with Python – Part 2

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup

import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"

listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

os.chdir(path)
data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)

data.update({key:{}})

This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"
    output.write(linestring.encode("utf-8"))

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:

output.close()

And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

from BeautifulSoup import BeautifulSoup

import os
import re

 path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
 listing = os.listdir(path)

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]
 os.chdir(path)
 data = {}
 for page in listing:
site = open(page,"rb")
soup = BeautifulSoup(site)
key = re.sub(".html","",page)
print key
data.update({key:{}})
 content = soup.find("div", id = "leftcontent")
title = content.findNext("h1").text
author = content.find("div",{"class":"meta"}).findNext("a").text
date = content.find("div",{"class":"date"}).text
data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

 output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

 keys = data.keys()
 variables = unicode(",".join(["id","date","author","title"]))
 header = variables + "\n"
 output.write(header.encode("utf8"))
 for key in keys:
print key
id = key
date = re.sub(",","",data[key]["date"])
author = data[key]["author"]
title = re.sub(",","",data[key]["title"])
title = re.sub("\\n","",title)
linelist = [id,date,author,title]
linestring = unicode(",".join(linelist))
linestring = linestring + "\n"
output.write(linestring.encode("utf-8"))
 output.close()

Source:http://www.r-bloggers.com/scraping-r-bloggers-with-python-part-2/

Wednesday, 26 November 2014

Data Mining and Frequent Datasets

I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.

My question is

    (c) A transactional dataset, T, is given below:
    t1: Milk, Chicken, Beer
    t2: Chicken, Cheese
    t3: Cheese, Boots
    t4: Cheese, Chicken, Beer,
    t5: Chicken, Beer, Clothes, Cheese, Milk
    t6: Clothes, Beer, Milk
    t7: Beer, Milk, Clothes

    Assume that minimum support is 0.5 (minsup = 0.5).

    (i) Find all frequent itemsets.

Here is how I worked it out:

    Item : Amount
    Milk : 4
    Chicken : 4
    Beer : 5
    Cheese : 4
    Boots : 1
    Clothes : 3

Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:

    {items} : Amount
    {Milk, Chicken} : 2
    {Milk, Beer} : 4
    {Milk, Cheese} : 1
    {Chicken, Beer} : 3
    {Chicken, Cheese} : 3
    {Beer, Cheese} : 2

Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?

data mining

Nanor

3 Answers

There are two ways to solve the problem:

    using Apriori algorithm
    Using FP counting

Assuming that you are using Apriori, the answer you got is correct.

The algorithm is simple:

First you count frequent 1-item sets and exclude the item-sets below minimum support.

Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.

The algorithm can go on until no item-sets are greater than threshold.

In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.

There is a solved example of further steps on Wikipedia here.

You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.

141

There is more than two algorithms to solve this problem. I will just mention a few of them: Apriori, FPGrowth, Eclat, HMine, DCI, Relim, AIM, etc. –  Phil Mar 5 '13 at 7:18

OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database.

Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.

Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule.

Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low.

This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.

Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).

I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.

1

The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent. If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears). This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that we compute only for items that are individually frequent. So if item y and item z are frequent on itselves, we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

Source: http://stackoverflow.com/questions/14164853/data-mining-and-frequent-datasets?rq=1

Wednesday, 19 November 2014

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.

Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Monday, 17 November 2014

Scraping websites using the Scraper extension for Chrome

If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs.

Walkthrough: Scraping a website with the Scraper extension
  •     Open Google Chrome and click on Chrome Web Store
  •     Search for “Scraper” in extensions
  •     The first search result is the “Scraper” extension
  •     Click the add to chrome button.
  •     Now let’s go back to the listing of UK MPs
  •     Open http://www.parliament.uk/mps-lords-and-offices/mps/
  •     Now mark the entry for one MP
  •     http://farm9.staticflickr.com/8490/8264509932_6cc8802992_o_d.png
  •     Right click and select “scrape similar…”
  •     http://farm9.staticflickr.com/8200/8264509972_f3a9e5d8e8_o_d.png
  •     A new window will appear – the scraper console
  •     http://farm9.staticflickr.com/8073/8263440961_9b94e63d56_b_d.jpg
  •     In the scraper console you will see the scraped content
  •     Click on “Save to Google Docs…” to save the scraped content as a Google Spreadsheet.
Walkthrough: extended scraping with the Scraper extension

Note: Before beginning this recipe – you may find it useful to understand a bit about HTML. Read our HTML primer.

Easy wasn’t it? Now let’s do something a little more complicated. Let’s say we’re interested in the roles a specific actress played. The source for all kinds of data on this is the IMDB (You can also search on sites like DBpedia or Freebase for this kinds of information; however, we’ll stick to IMDB to show the principle)

    Let’s say we’re interested in creating a timeline with all the movies the Italian actress Asia Argento ever starred; where do we start?

    The IMDB has a quite comprehensive archive of actors. Asia Argento’s site is: http://www.imdb.com/name/nm0000782/

    If you open the page you’ll see all the roles she ever played, together with a title and the year – let’s scrape this information

    Try to scrape it like we did above

    You’ll see the list comes out garbled – this is because the list here is structured quite differently.

    Go to the scraper console. Notice the small box on the upper left, saying XPath?

    XPath is a query language for HTML and XML.

    XPath can help you find the elements in the page you’re interested in – all you need to do is find the right element and then write the xpath for it.

    Now let’s assemble our table.

    You’ll see that our current Xpath – the one including the whole information is “//div[3]/div[3]/div[2]/div”

    http://farm9.staticflickr.com/8344/8264510130_ae31697fde_o_d.png

    Xpath is very simple it tells the computer to look at the HTML document and select <div> element number 3, then in this the third one, the second one and then all <div> elements (which if you count down our list, results in exactly where you are right now.
  •     However, we’d like to have the data separated out.
  •     To do this use the columns part of the scraper console…
  •     Let’s find our title first – look at the title using Inspect Element
  •     http://farm9.staticflickr.com/8355/8263441157_b4672d01b2_o_d.png
  •     See how the title is within a <b> tag? Let’s add the tag to our xpath.
  •     The expression seems to work well: let’s make this our first column
  •     In the “Columns” section, change the name of the first column to “title”
  •     Now let’s add the XPATH for the title to it
  •     The xpaths in the columns section are relative, that means “./b” will select the <b> element
  •     add “./b” to the xpath for the title column and click “scrape”
  •     http://farm9.staticflickr.com/8357/8263441315_42d6a8745d_o_d.png
  •     See how you only get titles?
  •     Now let’s continue for year? Years are within one <span>
  •     Create a new column by clicking on the small plus next to your “title” column
  •     Now create the “year” column with xpath “./span”
  •     http://farm9.staticflickr.com/8347/8263441355_89f4315a78_o_d.png
  •     Click on scrape and see how the year is added
  •     See how easily we got information out of a less structured webpage?
Source: http://schoolofdata.org/handbook/recipes/scraper-extension-for-chrome/

Saturday, 15 November 2014

A Web Scraper’s Guide to Kimono

Being a frequent reader of Hacker News, I noticed an item on the front page earlier this year which read, “Kimono – Never write a web scraper again.” Although it got a great number of upvotes, the tech junta was quick to note issues, especially if you are a developer who knows how to write scrapers. The biggest concern was a non-intuitive UX, followed by the inability of the first beta version to extract data items from websites as smoothly as the demo video suggested.

I decided to give it a few months before I tested it out, and I finally got the chance to do so recently.

Kimono is a Y-Combinator backed startup trying to do something in a field where others have failed. Kimono is focused on creating APIs for websites which don’t have one, another term would be web scraping. Imagine you have a website which shows some data you would like to dynamically process in your website or application. If the website doesn’t have an API, you can create one using Kimono by extracting the data items from the website.

Is it Legal?

Kimono provides an FAQ section, which says that web scraping from public websites “is 100% legal” as long as you check the robots.txt file to see which URL patterns they have disallowed. However, I would advise you to proceed with caution because some websites can pose a problem.

A robots.txt is a file that gives directions to crawlers (usually of search engines) visiting the website. If a webmaster wants a page to be available on search engines like Google, he would not disallow robots in the robots.txt file. If they’d prefer no one scrapes their content, they’d specifically mention it in their Terms of Service. You should always look at the terms before creating an API through Kimono.

An example of this is Medium. Their robots.txt file doesn’t mention anything about their public posts, but the following quote from their TOS page shows you shouldn’t scrape them (since it involves extracting data from their HTML/CSS).

    For the remainder of the site, you may not duplicate, copy, or reuse any portion of the HTML/CSS, JavaScipt, logos, or visual design elements without express written permission from Medium unless otherwise permitted by law.

If you check the #BuiltWithKimono section of their website, you’d notice a few straightforward applications. For instance, there is a price comparison API, which is built by extracting the prices from product pages on different websites.

Let us move on and see how we can use this service.

What are we about to do?

Let’s try to accomplish a task, while exploring Kimono. The Blog Bowl is a blog directory where you can share and discover blogs. The posts that have been shared by users are available on the feeds page. Let us try to get a list of blog posts from the page.

The simple thought process when scraping the data is parsing the HTML (or searching through it, in simpler terms) and extracting the information we require. In this case, let’s try to get the title of the post, its link, and the blogger’s name and profile page.

Source: http://www.sitepoint.com/web-scrapers-guide-kimono/

Thursday, 13 November 2014

The PromptCloud Advantage- Web Scraping with an Edge

The global market is now more aware of its data scraping needs. And so with the demand, the list of suppliers has grown too. This post is dedicated to bringing out the PromptCloud Advantage among such providers.

PromptCloud-Winning-The Race

1. The know-how- Crawling the web, as mundane as it may sound, is a fairly complex task. No one is to be blamed for overlooking the complexity as these things surface only after you’ve tried it yourself and delved into the nitty-gritty. The design decisions you take sit at the core of what you build and eventually monetize. And the long-term effects of such architectural choices are as pleasing if you’ve done it right as disturbing they might turn out if you’re not far-sighted.

Although the expertise of building the tech stack for such large-scale data acquisition, distributing your clusters (and putting thoughts into their geographical locations), maintaining queues, databases and backups, does come from ‘been there done that’, we have been lucky to have the tech advantage imbibed into us since inception. Not that we got it right the first time, but our systems have evolved with technologies, improving each day. Now that we have been there in this business for the last 56 months, it does feel like a long journey for our stack and yes, we do know better :)

2. SLAs- SLAs are what bolsters the data itself. PromptCloud’s key SLAs are scale and quality; while not compromising the data coverage or the politeness policies on your sources. Since we perform focused crawls, there’s no dilution of data and you can consume it all or ask us to index it in order to search using logical combinations in queries. For your reference, here’s a list of all SLAs to visit while picking your data service provider.

changing_place_changing_time_changing_thouts_changing_future.

3. The Experience- There are many scraping tools and crawling services in the market which might just serve the need. What PromptCloud provides is a data acquisition experience; and we go as many number of extra miles as you’d like us to go for it. By leveraging our DaaS platform, we make sure you get what you need from the time you start your research for a data provider through importing the data feeds into your database. We hear your requirements in detail, make sure we’ve got it right by sharing samples and going multiple iterations of reprocessing the data to match your needs while you battle internally on freezing your requirements. But what’s more magical is the way all these feeds get delivered to you, at the intervals you requested; programatically.

It might be evident for the SLAs and the know-how fusing to provide the experience, but it’s that additional human touch that actually aids in sustaining it. We make sure you’re at peace while our systems handle the roadblocks and sort out the messiness on the web.

Source:https://www.promptcloud.com/blog/the-promptcloud-advantage-web-scraping/

Wednesday, 12 November 2014

A Content Marketer's Guide to Data Scraping

As digital marketers, big data should be what we use to inform a lot of the decisions we make. Using intelligence to understand what works within your industry is absolutely crucial within content campaigns, but it blows my mind to know that so many businesses aren't focusing on it.

One reason I often hear from businesses is that they don't have the budget to invest in complex and expensive tools that can feed in reams of data to them. That said, you don't always need to invest in expensive tools to gather valuable intelligence — this is where data scraping comes in.

Just so you understand, here's a very brief overview of what data scraping is from Wikipedia:

    "Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program."

Essentially, it involves crawling through a web page and gathering nuggets of information that you can use for your analysis. For example, you could search through a site like Search Engine Land and scrape the author names of each of the posts that have been published, and then you could correlate this to social share data to find who the top performing authors are on that website.

Hopefully, you can start to see how this data can be valuable. What's more, it doesn't require any coding knowledge — if you're able to follow my simple instructions, you can start gathering information that will inform your content campaigns. I've recently used this research to help me get a post published on the front page of BuzzFeed, getting viewed over 100,000 times and channeling a huge amount of traffic through to my blog.

Disclaimer: One thing that I really need to stress before you read on is the fact that scraping a website may breach its terms of service. You should ensure that this isn't the case before carrying out any scraping activities. For example, Twitter completely prohibits the scraping of information on their site. This is from their Terms of Service:

    "crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited"

Google similarly forbids the scraping of content from their web properties:

    Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

So be careful, kids.

Content analysis

Mastering the basics of data scraping will open up a whole new world of possibilities for content analysis. I'd advise any content marketer (or at least a member of their team) to get clued up on this.

Before I get started on the specific examples, you'll need to ensure that you have Microsoft Excel on your computer (everyone should have Excel!) and also the SEO Tools plugin for Excel (free download here). I put together a full tutorial on using the SEO tools plugin that you may also be interested in.

Alongside this, you'll want a web crawling tool like Screaming Frog's SEO Spider or Xenu Link Sleuth (both have free options). Once you've got these set up, you'll be able to do everything that I outline below.

So here are some ways in which you can use scraping to analyse content and how this can be applied into your content marketing campaigns:

1. Finding the different authors of a blog

Analysing big publications and blogs to find who the influential authors are can give you some really valuable data. Once you have a list of all the authors on a blog, you can find out which of those have created content that has performed well on social media, had a lot of engagement within the comments and also gather extra stats around their social following, etc.

I use this information on a daily basis to build relationships with influential writers and get my content placed on top tier websites. Here's how you can do it:

Step 1: Gather a list of the URLs from the domain you're analysing using Screaming Frog's SEO Spider. Simply add the root domain into Screaming Frog's interface and hit start (if you haven't used this tool before, you can check out my tutorial here).

Once the tool has finished gathering all the URLs (this can take a little while for big websites), simply export them all to an Excel spreadsheet.

Step 2: Open up Google Chrome and navigate to one of the article pages of the domain you're analysing and find where they mention the author's name (this is usually within an author bio section or underneath the post title). Once you've found this, right-click their name and select inspect element (this will bring up the Chrome developer console).

Within the developer console, the line of code associated to the author's name that you selected will be highlighted (see the below image). All you need to do now is right-click on the highlighted line of code and press Copy XPath.

For the Search Engine Land website, the following code would be copied:

//*[@id="leftCol"]/div[2]/p/span/a

This may not make any sense to you at this stage, but bear with me and you'll see how it works.

Step 3: Go back to your spreadsheet of URLs and get rid of all the extra information that Screaming Frog gives you, leaving just the list of raw URLs – add these to the first column (column A) of your worksheet.

Step 4: In cell B2, add the following formula:

=XPathOnUrl(A2,"//*[@id='leftCol']/div[2]/p/span/a")

Just to break this formula down for you, the function XPathOnUrl allows you to use the XPath code directly within (this is with the SEO Tools plugin installed; it won't work without this). The first element of the function specifies which URL we are going to scrape. In this instance I've selected cell A2, which contains a URL from the crawl I did within Screaming Frog (alternatively, you could just type the URL, making sure that you wrap it within quotation marks).

Finally, the last part of the function is our XPath code that we gathered. One thing to note is that you have to remove the quotation marks from the code and replace them with apostrophes. In this example, I'm referring to the "leftCol" section, which I've changed to ‘leftCol' — if you don't do this, Excel won't read the formula correctly.

Once you press enter, there may be a couple of seconds delay whilst the SEO Tools plugin crawls the page, then it will return a result. It's worth mentioning that within the example I've given above, we're looking for author names on article pages, so if I try to run this on a URL that isn't an article (e.g. the homepage) I will get an error.

For those interested, the XPath code itself works by starting at the top of the code of the URL specified and following the instructions outlined to find on-page elements and return results. So, for the following code:

//*[@id='leftCol']/div[2]/p/span/a

We're telling it to look for any element (//*) that has an id of leftCol (@id='leftCol') and then go down to the second div tag after this (div[2]), followed by a p tag, a span tag and finally, an a tag (/p/span/a). The result returned should be the text within this a tag.

Don't worry if you don't understand this, but if you do, it will help you to create your own XPath. For example, if you wanted to grab the output of an a tag that has rel=author attached to it (another great way of finding page authors), then you could use some XPath that looked a little something like this:

//a[@rel='author']

As a full formula within Excel it would look something like this:

=XPathOnUrl(A2,"//a[@rel='author']")

Once you've created the formula, you can drag it down and apply it to a large number of URLs all at once. This is a huge time-saver as you'd have to manually go through each website and copy/paste each author to get the same results without scraping – I don't need to explain how long this would take.

Now that I've explained the basics, I'll show you some other ways in which scraping can be used…

2. Finding extra details around page authors

So, we've found a list of author names, which is great, but to really get some more insight into the authors we will need more data. Again, this can often be scraped from the website you're analysing.

Most blogs/publications that list the names of the article author will actually have individual author pages. Again, using Search Engine Land as an example, if you click my name at the top of this post you will be taken to a page that has more details on me, including my Twitter profile, Google+ profile and LinkedIn profile. This is the kind of data that I'd want to gather because it gives me a point of contact for the author I'm looking to get in touch with.

Here's how you can do it.

Step 1: First we need to get the author profile URLs so that we can scrape the extra details off of them. To do this, you can use the same approach to find the author's name, with just a little addition to the formula:

=XPathOnUrl(A2,"//a[@rel='author']", <strong>"href"</strong>)

The addition of the "href" part of the formula will extract the output of the href attribute of the atag. In Lehman terms, it will find the hyperlink attached to the author name and return that URL as a result.

Step 2: Now that we have the author profile page URLs, you can go on and gather the social media profiles. Instead of scraping the article URLs, we'll be using the profile URLs.

So, like last time, we need to find the XPath code to gather the Twitter, Google+ and LinkedIn links. To do this, open up Google Chrome and navigate to one of the author profile pages, right-click on the Twitter link and select Inspect Element.

Once you've done this, hover over the highlighted line of code within Chrome's developer tools, right-click and select Copy XPath.

Step 3: Finally, open up your Excel spreadsheet and add in the following formula (using the XPath that you've copied over):

=XPathOnUrl(C2,"//*[@id='leftCol']/div[2]/p/a[2]", "href")

Remember that this is the code for scraping Search Engine Land, so if you're doing this on a different website, it will almost certainly be different. One important thing to highlight here is that I've selected cell C2 here, which contains the URL of the author profile page and not just the article page. As well as this, you'll notice that I've included "href" at the end because we want the actual Twitter profile URL and not just the words ‘Twitter'.

You can now repeat this same process to get the Google+ and LinkedIn profile URLs and add it to your spreadsheet. Hopefully you're starting to see the value in this, and how it can be used to gather a lot of intelligence that can be used for all kinds of online activity, not least your SEO and social media campaigns.

3. Gathering the follower counts across social networks

Now that we have the author's social media accounts, it makes sense to get their follower counts so that they can be ranked based on influence within the spreadsheet.

Here are the final XPath formulae that you can plug straight into Excel for each network to get their follower counts. All you'll need to do is replace the text INSERT SOCIAL PROFILE URL with the cell reference to the Google+/LinkedIn URL:

Google+:

=XPathOnUrl(<strong>INSERTGOOGLEPROFILEURL</strong>,"//span[@class='BOfSxb']")

LinkedIn:

=XPathOnUrl(<strong>INSERTLINKEDINURL</strong>,"//dd[@class='overview-connections']/p/strong")

4. Scraping page titles

Once you've got a list of URLs, you're going to want to get an idea of what the content is actually about. Using this quick bit of XPath against any URL will display the title of the page:

=XPathOnUrl(A2,"//title")

To be fair, if you're using the SEO Tools plugin for Excel then you can just use the built-in feature to scrape page titles, but it's always handy to know how to do it manually!

A nice extra touch for analysis is to look at the number of words used within the page titles. To do this, use the following formula:

=CountWords(A2)

From this you can get an understanding of what the optimum title length of a post within a website is. This is really handy if you're pitching an article to a specific publication. If you make the post the best possible fit for the site and back up your decisions with historical data, you stand a much better chance of success.

Taking this a step further, you can gather the social shares for each URL using the following functions:

Twitter:

=TwitterCount(<strong>INSERTURLHERE</strong>)

Facebook:

=FacebookLikes(<strong>INSERTURLHERE</strong>)

Google+:

=GooglePlusCount(<strong>INSERTURLHERE</strong>)

Note: You can also use a tool like URL Profiler to pull in this data, which is much better for large data sets. The tool also helps you to gather large chunks of data from other social networks, link data sources like Ahrefs, Majestic SEO and Moz, which is awesome.

If you want to get even more social stats then you can use the SharedCount API, and this is how you go about doing it…

Firstly, create a new column in your Excel spreadsheet and add the following formula (where A2 is the URL of the webpage you want to gather social stats for):

=CONCATENATE("http://api.sharedcount.com/?url=",A2)

You should now have a cell that contains your webpage URL prefixed with the SharedCount API URL. This is what we will use to gather social stats. Now here's the Excel formula to use for each network (where B2 is the cell that contaiins the formula above):

StumbleUpon:

=JsonPathOnUrl(B2,"StumbleUpon")

Reddit:

=JsonPathOnUrl(B2,"Reddit")

Delicious:

=JsonPathOnUrl(B2,"Delicious")

Digg:

=JsonPathOnUrl(B2,"Diggs")

Pinterest:

=JsonPathOnUrl(B2,"Pinterest")

LinkedIn:

=JsonPathOnUrl(B2,"Linkedin")

Facebook Shares:

=JsonPathOnUrl(B2,"Facebook.share_count")

Facebook Comments:

=JsonPathOnUrl(B2,"Facebook.comment_count")

Once you have this data, you can start looking much deeper into the elements of a successful post. Here's an example of a chart that I created around a large sample of articles that I analysed within Upworthy.com.

The chart looks at the average number of social shares that an article on Upworthy receives vs the number of words within its title. This is invaluable data that can be used across a whole host of different on-page elements to get the perfect article template for the site you're pitching to.

See, big data is useful!

5. Date/time the post was published

Along with analysing the details of headlines that are working within a site, you may want to look at the optimal posting times for best results. This is something that I regularly do within my blogs to ensure that I'm getting the best possible return from the time I spend writing.

Every site is different, which makes it very difficult for an automated, one-size-fits-all tool to gather this information. Some sites will have this data within the <head> section of their webpages, but others will display it directly under the article headline. Again, Search Engine Land is a perfect example of a website doing this…

So here's how you can scrape this information from the articles on Search Engine Land:

=XPathOnUrl(<strong>INSERTARTICLEURL</strong>,"//*[@class='dateline']/text()")

Now you've got the date and time of the post. You may want to trim this down and reformat it for your data analysis, but you've got it all in Excel so that should be pretty easy.

Extra reading

Data scraping is seriously powerful, and once you've had a bit of a play around with it you'll also realise that it's not that complicated. The examples that I've given are just a starting point but once you get your creative head on, you'll soon start to see the opportunities that arise from this intelligence.

Here's some extra reading that you might find useful:

    http://findmyblogway.com/scraping-communities-with-xpath/

    http://builtvisible.com/data-entry-is-a-waste-of-time/

    http://www.seotakeaways.com/data-scraping-guide-for-seo/

    http://okdork.com/2014/04/30/the-step-by-step-guide-to-10x-growth-for-any-blog/

TL;DR

    Start using actual data to inform your content campaigns instead of going on your gut feeling.

    Gather intelligence around specific domains you want to target for content placement and create the perfect post for their audience.

    Get clued up on XPath and JSON through using the SEO Tools plugin for Excel.

    Spend more time analysing what content will get you results as opposed to what sites will give you links!

    Check the website's ToS before scraping.

Source:http://moz.com/blog/a-content-marketers-guide-to-data-scraping

Monday, 10 November 2014

My Experience in Choosing a Web Scraping Service

Recently I decided to outsource a web scraping project to another company. I typed “web scraping service” in Google, chose six services from the first two search result pages and sent the project specifications to all of them to get quotes. Eventually I decided to go another way and did not order the services, but my experience may be useful for others who want to entrust web scraping jobs to third party services.

If you are interested in price comparisons only and not ready to read the whole story just scroll down.

A list of web scraping services I sent my project to:

    www.datahen.com - Canadian web scraping service with nice web design
    webdata-scraping.com - Indian service by Keval Kothari
    www.iwebscraping.com - India based web scraping company (same as www.3idatascraping.com)
    scrapinghub.com - A scraping service founded by creators of Scrapy
    web-scraper.com - Yet another web scraping service
    grepsr.com - A scraping service that we already reviewed two years ago


Sending the request

All the services except scrapinghub.com have quite simple forms for the description of the project requirements. Basically, you just need to give your contact details and a project description in any form. Some of them are pretty (like datahen.com), some of them are more ascetic (like web-scraper.com), but all of them allow you to send your requirements to developers.

Scrapinghub.com has a quite long form, but most of the fields are optional and all the questions are quite natural. If you really know what you need, then it won’t be hard to answer all of them; moreover they rather help you to describe your need in detail.

Note, that in the context of the project I didn’t make a request for a scraper itself. I asked to receive data on a weekly basis only.

Getting responses

Since I sent my request on Sunday it would have been ok not to receive responses the same day, but I got the first response in 3 hrs! It was from web-scraper.com and stated that this project will cost me $250 monthly. Simple and clear. Thank you, Thang!

Right after that, I received the second response. This time it was Keval from webdata-scraping.com. He had some questions regarding the project. Then after two days he wrote me that it would be hard to scrape some of my data with the software he uses, and that he will try to use a custom scraper. After that he disappeared… ((

Then on Monday I received Cost & ETAT details from datahen.com. It looked quite professional and contained not only price, but also time estimation. They were ready to create such a scraper in 3-4 days for $249 and then maintain it for just $65/month.

On the same day I received a quote from iwebscraping.com. It was $60 per week. Everything is fine, but I’d like to mention that it wasn’t the last letter from them. After I replied to them (right after receiving the quote), I received a reminder letter from them every other day for about a week. So be ready for aggressive marketing if you ask them for a quote )).

Finally in two days after requesting a quote I got a response from scrapinghub.com. Paul Tremberth wrote me that they were ready to build a scraper for $1200 and then maintain it for $300/month.

It is interesting that I have never received an answer from grepsr.com! Two years ago it was the first web scraping service we faced on the web, but now they simply ignored my request! Or perhaps they didn’t receive it somehow? Anyway I had no time for investigation.

So what?

Let us put everything together. Out of six web scraping  services I received four quotes with the following prices:

Service     Setup fee     Monthly fee

web-scraper.com     -     $250
datahen.com     $249     $65
iwebscraping.com     -     $240
scrapinghub.com     $1200     $300


From this table you can see that  scrapinghub.com appears to be the most expensive service among those compared.

EDIT: These $300/month gives you as much support and development needed to fix a 5M multi-site web crawler, for example. If you need a cheaper solution you can use their Autoscraping tool, which is free, and would have costed around $2/month to crawl at my requested rates.

The average cost of monthly scraping is about $250, but from a long term perspective datahen.com may save you money due to their low monthly fee.

That’s it! If I had enough money available it would be interesting to compare all these services in operation and provide you a more complete report, but this is all I have for now.

If you have anything to share about your experience in using similar services, please contribute to this post by commenting on it below. Cheers!

Source: http://scraping.pro/choosing-web-scraping-service/

Saturday, 8 November 2014

Web Scraping and Copyright

There are rigorous debates on the topic of copyright issues of a scraped data. Whether it is legal to use web scraping to extract data from a website or it’s just an illegal act that will lead you to a troublesome situation. There are certain implementations when we talk of web scraping. There are certain websites that will provide you with RSS feeds.

Just grab the piece use it and get the credit. However there are sites that do not allow such kind of cooperation (as we will call this!) and thus there is no other way to extract reliable data. Another way is to hold the “ctrl” plus “c” keys and wait for a while for the data to be copied to your computer.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-copyright/

Wednesday, 5 November 2014

Why Web Scraping is Indispensable

The 21st century has opened the gates to hidden treasures and unlimited access to information globally without the constraints of time and space, through Internet technology. Along with this development comes the necessity for each business or company to get as much information as possible in order in order to thrive in the ever increasing demand for new innovations, comparisons, and trends.

Web scraping has consequently become an indispensable option to achieve all the needed data as quickly and efficiently as possible. In this view, data mining then appears to be the best and the only way to answer the present demand for updates, data, coping, foreknowledge, analysis, and evaluation. Indeed, information has inevitably become a valuable commodity and the most sought after product among online and offline entrepreneurs.

Need for Data

The increasing need for new data makes it possible for the experts to become increasingly creative in accessing information worldwide. The more knowledge one has, the better are his or her chances of growing and surviving. There seems to be no other time in the human existence where data has become so much a major source of revenue as the contemporary times.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-indispensable/