Setup different type of extraction with Web Data Extractor

Key words:

  • WDE spiders 18+ Search engines for right web sites and get data from them.

    Quick Start:

    Select "Search Engines" source - Enter keyword - Click OK

    What WDE Does: WDE will query 18+ popular search engines, extract all matching URLs from search results, remove duplicate URLs and finally visits those websites and extract data from there.

    You can tell WDE how many search engines to use. Click "Engines" button and uncheck listing that you do not want to use. You can add other engine sources as well.

    WDE send queries to search engines to get matching website URLs. Next it visits those matching websites for data extraction. How many deep it spiders in the maching websites depends on "Depth" setting of "External Site" tab.

DEPTH:

  • Here you need to tell WDE - how many levels to dig down within the specified website. If you want WDE to stay within first page, just select "Process First Page Only". A setting of "0" will process and look for data in whole website. A setting of "1" will process index or home page with associated files under root dir only.

    For example: WDE is going to visit URL http://www.xyz.com/product/milk/ for data extraction.

    Lets say www.xyz.com has following text/html pages:

    • http://www.xyz.com/
    • http://www.xyz.com/support.htm
    • http://www.xyz.com/about.htm
    • http://www.xyz.com/product/
    • http://www.xyz.com/product/support.htm
    • http://www.xyz.com/product/milk/
    • http://www.xyz.com/product/water/
    • http://www.xyz.com/product/milk/baby/
    • http://www.xyz.com/product/milk/baby/page1.htm
    • http://www.xyz.com/product/milk/baby/page2.htm
    • http://www.xyz.com/product/water/mineral/
    • http://www.xyz.com/product/water/mineral/news.htm

    WDE is powerful and fully featured unique spider! You need to decide how deep you want WDE to look for data.

    WDE can retrieve: Set options:
    Only matching URL page of search ( URL #6 ) Select "Process First Page Only"
    Entire milk dir (URL #6 - 10 ) Select "Depth=0" and check "Stay within Full URL"
    Entire www.xyz.com site Select "Depth=0"
    Only www.xyz.com page Select "Process First Page Only" and
    check "Spider Base URL Only"
    Only root dir file (URL #1 - 3) Select "Depth=1"
    Only URL #1 - 5 Select "Depth=2"

    Stop Site on First Email Found:
    Each website is structured differently on the server. Some websites may have only few files and some may have thousands of files. So sometimes you may prefer to use "Stop Site on First Email Found" option. For example: you set WDE to go entire www.xyz.com site and WDE found email in #2 URL (support.htm) . If you tell WDE to "Stop Site on First email Found then it will not go for other pages (#3-12)

    Spider Base URL Only:
    With this option you can tell WDE to process always the Base URLs of external sites. For example: in above case, if an external site found like http://www.xyz.com/product/milk/ then WDE will grab only base www.xyz.com. It will not visit http://www.xyz.com/product/milk/ unless you set such depth that covers also milk dir.

    Ignore Case of URLs:
    Set this option to avoid duplicate URLs like
    http://www.xyz.com/product/milk/
    http://www.xyz.com/Product/Milk/
    These 2 URLs are same. When you set to ignore URLs case, then WDE convert all URLs to lowercase and can remove duplicate URLs like above. However - some servers are case-sensitive and you should not use this option on those special sites.

WebSites:

  • Enter website URL and extract all data found in that site

    Quick Start:

    Select 2nd option "WebSite/Dir/Groups" - Enter website URL - Select Depth - Click OK

    What WDE Does: WDE will retrieve html/text pages of the website according to the Depth you specified and extract all data found in those pages.

    # By default, WDE will stay only the current domain.

    # WDE can also follow external sites!

    If you want WDE to retrieve files of external sites that are linked from starting site specified in "General" tab, then you need to set "Follow External URLs" of "External Site" tab. In this case, by default, WDE will follow external sites only once, that is - (1) WDE will process starting address and (2) all external sites found in starting address. It will not follow all external sites found in (2) and so on...

    WDE is powerful, if you want WDE to follow external sites with unlimited loop, select "Unlimited" in "Spider External URls Loop" combo box, and remember you need to manually stop WDE sesssion, because this way WDE can travel entire internet.

Directories:

  • Choose yahoo, google or other directory and get all data from there.

    Quick Start & What WDE Does:

    Lets say you want to extract data of all companies listed at
    http://directory.google.com/Top/Computers/Software/Freeware/

    Action #1A:
    Select 2nd option "WebSite/Dir/Groups" - enter this URL in "Starting Address" box - select "Process First Page Only"

    Or, lets say you want to extract data of all companies listed at
    http://directory.google.com/Top/Computers/Software/Freeware/
    plus all down level folders like
    http://directory.google.com/Top/Computers/Software/Freeware/windows
    http://directory.google.com/Top/Computers/Software/Freeware/windows/browser
    http://directory.google.com/Top/Computers/Software/Freeware/linux
    etc....

    Action #1B:
    Select 2nd option "WebSite/Dir/Groups" - enter URL http://directory.google.com/Top/Computers/Software/Freeware/ in "Starting Address" box - select Depth=0 and "Stay within Full URL" option.

    With these actions WDE will download http://directory.google.com/Top/Computers/Software/Freeware/ page and optionally all down level pages and will build a URLs list of companies listed there.

    Now you want WDE to visit all those URLs and extract all data found in those sites.

    Action #2:
    So after either above action you must move to "External Site" tab and check "Follow External URLs" option. (Remember: this setting tells WDE to process/follow/visit all URLs found while processing "Starting Address" of "General" tab).

List of URL: