Web Data Extractor - Extract URL, Meta Tag, Email, Phone, Fax from Web

Glossary

1. WDE Pro Toolbar

The Web Extractor toolbar provides one-click access to the application most frequently used Main menu items as well as provides you with the statistic information about current extraction session.

New session

A dialog with various session settings (such as search engines, retrieval depth, etc.)

Edit session

Enables to modify parameters of a session that has not been started. The button is disabled when the session is in progress. If you want to edit a session that has been started, press Pause/Stop, then press Edit Session button and finally, in Session dialog press New Session.

Start

Run specified (current or stored) session

Pause

Interrupt or resume session running

Stop

Interrupt session running (enables to create new session)

Diagram

Representation of the current extraction speed graphically and through data rates measurement (in bits per second)

Options

General and more specific (connection, search engines, parser) settings. Once specified, they will be applied to all following sessions.

General

Temp path

The path of program temporary folder

Output path

The path of Results folder

Encoding

Type of Results encoding

Connection

Performance

WDE is a multithreaded spider, that is it can simultaneously visit different web pages. You can set the number of threads on the requester, downloader and parser separately. This controls the number of separate threads or connections that program will use to extract data for the project. Remember, that the more threads you choose the more you load your computer, Internet connection and host server which may actually slow the process down.

Human factor

Inserting a random delay between queries. This sometimes makes a website think that it is visited by a person, not a spider. So this site does not apply any preventive measures.

User agent

The identity of WDE when communicating with remote servers

Proxy

Proxy option gives you an opportunity to compose and upload proxy servers list and extract more information with it.

Search engines

You can choose search engine(s) of 64 different countries. Settings of this dialog will be saved for all future sessions until you change them. Unlike settings entered in New Session dialog → Data Source → Search Engines → Select Search Engines dialog, that will be valid only for one session.

Parser

Phone/fax parser options

Enter phone/fax prefix for WDE to distinguish corresponding data type

Email prefix

Enter email prefix for WDE to distinguish corresponding data type

Meta tags extraction

Set the number of characters that Title, Description and Keywords may contain (displayed in Results). You can also choose size unit (to be displayed in Process Log, Results) and date format (to be displayed in Session Name, Results and Stored Sessions).

Valid domains

Check this option so that all extracted data is verified against domain list. By default it is always checked. To add new or additional domain open "DomainList.txt" file in program installation folder and edit it using NotePad or other text editor.

About

Learn more about WDE Pro in Help, download new tools for your program or report us about a problem

Process log

Logging form containing information on session running process

Results

Various information on data types (URLs, domains, meta tags, emails, phones, faxes) you've chosen to extract. In this mode Save results button in upper right corner is active.

Bad URLs

URLs that WDE had problems visiting. In this mode Save Bad URLs button in upper right corner is active.

Stored sessions

A list of session (and groups of sessions) names with start/pause/stop information. Use Filter for quick search through session (group) names. In this mode Delete Stored Sessions button in upper right corner is active. To start selected stored session press Start button. To create a new session based on the settings of one of the stored sessions – double-click on one of the stored sessions and click New Session in the opened dialog. To view Results of a stored session – double-click one and then close the emersed dialog.

Save

This button is active only for Results and Bad URLs mode. A dialog that opens will help you to customize saving process. WDE Pro saves data in comma separated value format (CSV). CSV file can be exported to Excel, OpenOffice.org Calc, Apple Numbers, Google Docs & Spreadsheets, many databases, etc.

2. Session

A dialog with various session settings (such as search engines, retrieval depth, etc.).

Note: Before launching new session, always make sure that it doesn't contain settings from the previous project (filters, custom data, etc.).

General

Session name

By default consists of session creation time and website URL/keywords put in the entry field

Group

Enter the name of the group and all following sessions will belong to the set group. This will organize collected session data. Use Filter to simplify the output of required session information in Stored Sessions.

Timeline

Indicates timeline of the Stored Session

New session

Current session status. When one of the Stored Sessions is activated, this status is changed to: Started (dd.mm.yy hh:mm:ss), Paused (dd.mm.yy hh:mm:ss), and, if session was finished – Finished (dd.mm.yy hh:mm:ss).

Data source

Site

Information extraction from a specified website. In this mode WDE will retrieve html/text pages of the website according to the depth you specifiy and extract all data found in those pages.

Start URL

Enter an URL that you want WDE to start searching from: a domain name or specific domain sub-directory

Depth

Distance between a directory containing Start URL and a directory that we want to be spidered last. The index of depth tells WDE how many levels to dig down within the specified domain. By default maximal depth (10) is selected. Depth option is customizable only in Site and URL List section. It does not work in Search Engines section.

Example:

Negative depth:

Depth = -10 (maximum): WDE looks for data 10 levels up from the start level

Depth = -3: WDE goes one more level up and spiders all subdirectories found in all directories at depth 1, -2

Depth = -2: WDE goes one level up from the start level and spiders all subdirectories of start level directories

Depth = -1: same as depth 1

Depth = 0:WDE visits ONLY pages of the Start URL directory. It does not search in other folders within start folder. Start URL is always at a depth 0.

Positive depth:

Depth = 1: WDE stays on the start level and spiders ALL directories within this level

Depth = 2: WDE goes one level down from the start level and spiders all subdirectories of start level directories

Depth = 3: WDE goes one more level down and spiders all subdirectories found in all directories at depth 1, 2

Depth = 10 (maximum): WDE looks for data 10 levels down from the start level

Depth = -4: http://www.xyz.com/index.htm

Depth = -3: http://www.xyz.com/W/product.htm

Depth = -2: http://www.xyz.com/W/D/download.htm

Depth = 0: http://www.xyz.com/W/D/E/purchase.htm

Depth = 1: http://www.xyz.com/W/D/E1/modules.htm

Depth = 2: http://www.xyz.com/W/D/E/P/help.htm

Depth = 3: http://www.xyz.com/W/D/E/P/R/faq.htm

Depth = 4: http://www.xyz.com/W/D/E/P/R/O/support.htm

To spider (process, visit) a directory (folder) means to spider every web page (file) in it. To reach specified depth WDE spiders a site consistently undergoing all previous layers.

One page only

WDE stays only within the default page of specified start URL

Offsite

WDE can retrieve files of external sites that are linked to the site specified in Start URL box. By default, WDE will follow external sites only once, that is (1) WDE will process starting address and (2) all external sites found in starting address. It will not follow all external sites found in (2) and so on. In the Offsite mode the program spiders both start URL and all offsite URLs on maximal depth. WDE is powerful, so remember to stop WDE sesssion manually, because this way WDE can travel entire Internet.

Positive

Search in lower levels comparing with the Start level

Negative

Search in upper levels comparing with the Start level

Example:

WDE is going to visit URL http://www.xyz.com/product/milk/ for data extraction. The structure of this process is as follows:

URL	URL Depth	Selected Depth
URL	URL Depth	<=-3	-2	-1	0	1	>=2
http://www.xyz.com/	-3	x
http://www.xyz.com/support.htm	-3	x
http://www.xyz.com/about.htm	-3	x
http://www.xyz.com/product/	-2	x	x
http://www.xyz.com/product/support.htm	-2	x	x
http://www.xyz.com/product/milk/index.htm	One Page only	x	x	x	x	x	x
http://www.xyz.com/product/milk/	0	x	x	x	x	x	x
http://www.xyz.com/product/milk/baby/	2						x
http://www.xyz.com/product/milk/baby/page1.htm	2						x
http://www.xyz.com/product/milk/baby/page2.htm	2						x
http://www.xyz.com/product/water/	1					x	x
http://www.xyz.com/product/water/mineral/	2						x
http://www.xyz.com/product/water/mineral/news.htm	2						x

Search engines

Information extraction from all specified search engines. In this mode WDE sends queries to specified search engines to get matching website URLs from search results. Next it removes duplicate URLs. Finally, it visits those URLs to extract data from them.

Keyword

A word or phrase (that together with other words or phrases) relevantly represent a topic

Select search engines

You can choose search engine(s) of 64 different countries. Settings of this dialog will be valid only for one session. Unlike settings entered in Options → Search Engines dialog, that will be saved for all future sessions until you change them.

URL list

Information extraction from a text file containing a list of URLs. In this mode WDE will visit websites from the list of URLs one-by-one according to the depth you specify. WDE accepts URLs that point to root directory as well as URLs pointing to specific html file.

File

Enter the file name that contains all URL links to process or browse it from your computer

Browse

Choose the file containing list of URLs. This must be plain text file with one URL per line, each line starting with http://

Open

Click this button to look through the browsed file

URL filter

Set of URL adjustments to customize data extraction.

Note: Most of the sites are case-insensitive. So by default WDE ignores case of URLs, which means it regards two following URLs as same:

http://www.abc.com

http://WWW.ABC.COM

Page must contain keywords

Set this option if you want to specify a list of keywords that an URL must contain before it is downloaded. You can enter one or more keywords line by line.

Page must not contain keywords

Set this option if you want to specify a list of keywords that an URL must not contain before it is downloaded. You can enter one or more keywords line by line.

Example:

You do not want to process files from folder "http://www.xyz.com/movies", so enter “/movies” in the exclude box so WDE will not download anything from that folder. Slash “/” is somewhat needed to make sure you filter a folder and not a file (movies.html).

Page must contain following text to extract data

Use text filter to extract data only from those web pages that contain text you specify. You can specify one or more keywords in the box and set “Match any”/”Match all” text.

Example:

If ”Match any” is set then WDE will accept a testing webpage that contains any of your keywords. If ”Match all” is set then a testing webpage must contain all of your keywords for WDE to start data extraction from that page.

Fixed number of pages

WDE will only go through set amount of pages

Authentication

First page authentication

If WDE comes across some website where registration is needed, it will ask you for your user login and password to enter the website

Extraction data

Custom data

Enables to enter Custom Data Editor

URLs, Domains, Meta Tags, Emails, Phones, Faxes

Types of information that can be extracted

Email filter

Email filter is used to filter emails during data extraction

No duplicate domains

Check this option if you need only one email address from each domain represented in the filtered list. This is really useful when the list contains emails in corporate domains. Using this option you avoid mailing one message to the same company repeatedly. But at the same time you keep only one email address per a web-based service (domains Yahoo, Hotmail, MSN, etc.), where in fact each address belongs to a different person. To solve this problem there is a sub-option Except Mail Services.

Example:

The input list is:

mary@company.com
alex@magazine.com
snail@yahoo.com
smith@market.com
jane@market.com
twiggy@yahoo.com
info@company.com
job@magazine.com
nicky@yahoo.com

f the option No Duplicate Domains is checked, the result will be as follows:

alex@magazine.com
mary@company.com
smith@market.com
snail@yahoo.com

Except mail services

Check this option together with No Duplicate Domains to keep in your list all email addresses which belong to domains that you list in the box below. In any other domain still only one email address will be kept.

Example:

The input list is the same:

mary@company.com
alex@magazine.com
snail@yahoo.com
smith@market.com
jane@market.com
twiggy@yahoo.com
info@company.com
job@magazine.com
nicky@yahoo.com

If the options No Duplicate Domains and Except Mail Services are checked (and the Yahoo domain is indicated in the box below), the result will be as follows:

alex@magazine.com
mary@company.com
nicky@yahoo.com
smith@market.com
snail@yahoo.com
twiggy@yahoo.com

Preference (keep it duplicate)

This option allows you to specify more details in case duplicates occur within a domain.

For example, WDE came across two emails:info@company.com;

contact@company.com. If you activate this option and specify the word “contact” in the box below, then

the first email will be rejected (that is, placed to the tab Skipped) and the second one will be accepted (placed to the tab Filtered).

No duplicate URL domains

While the option Duplicate domains works with the domains that follow the '@' sign in email addresses, the option No duplicate URL domains is responsible for taking only one email address from each processed URL (web page).

Each email must match URL domain

WDE will look only for such emails whose domains match their own URL domains. Like abc@xyz.com and www.xyz.com.

Remove addresses in these domains

Check this option if you need to skip all email addresses from the specified domains. List these domains in the box below, one domain per line, starting with a dot (.org).

Collect domains without emails

Not all websites contain emails, even though they satisfy search criteria. That's when this option is useful – it lists these websites. Then on this base another program (ListMotor http://www.listmotor.biz/) will help you generate valid email list from the standard list of usernames (info, admin, support). This option is available only for Search Engines and URL List mode.

Example:

There is a website: abc.com, that doesn't contain any explicit emails. But it likely contains some service email (like admin@abc.com). With the help of WDE Pro and ListMotor you will get this email.

Remove address if it contains words

Use this option if you want to exclude email addresses containing specific words.

List these words in the box under the option, one word per line.

For example, if you put the word “bad” on the list, the following emails will be skipped:

badminton@sports.org, info@baden-baden.de, etc.

You can use Spam button to fill in the edit box of well-known spam sources.

Once edited, this setting remains valid for all further sessions until you change it.

Please note: you MUST NOT use WDE for spam purpose. We strongly discourage spam.

Include address only if it contains words

Check this option if you need to keep only email addresses that contain specific words. List these words in the box below, one word per line.

For example, if you put the word “nice” on the list, the following emails will be kept: city-of-nice@france.fr, info@venice.it, etc.

Include address only if page title contains words

Check this option if you need to keep only email addresses extracted from web pages where titles contain specific words. This words appear in the header of your Internet browser window if you visit the page. List these words in the box under the option, one word per line.

Custom data expressions

Custom data editor

This tool is developed to enable you extract items of information that have common visual structure from specified website. They are used to be called expressions or patterns. For example, to form a list of products of specified online store.

WDE finds information with the use of custom data expressions. To create such an expression correctly is a complex task demanding special knowledge. Custom Data Editor enables you to form and edit expressions automatically. This dialog can be accessed only in Data Source → Site mode. This dialog can be also accessed by hovering a cursor over the triangle button of Data1 (or other) tab.

Data1

Adjustment suit that sets the search mode of information container. You can view, add, remove or edit data piece(s) you created (25 data pieces maximum). To switch between data tabs you can use left-right, up-down keyboard buttons or ctrl+tab.

Begins with; Contains; Ends with

These boxes contain fragments of an expression that is built based on these fragments. When creating or editing expressions you can use “*” instead of any amount of any symbols and “|” to denote an alternative.

Example:

The complete expression:(?s)(?i)(?<=ttl">)(.*?< img itemprop="image" src=.*?)(?=

Contains:img itemprop="image" src=

Ends with: </a> </d

Help

Leads you to http://www.webextractor.com/help.htm

Start

Launch specified session

Close

Close session dialog

3. Custom data editor

This dialog can be also accessed by hovering a cursor over the triangle button of Data1 (or other) tab.

Match case

Case sensitive search of information containers (not default)

Match new line

Enables search of one expression in several lines (not default)

Begins with; Contains; Ends with

Example:

The complete expression: (?s)(?i)(?<=ttl">)(.*?


                  Begins with: ttl"> 
                  Contains:img itemprop="image" src=
                  Ends with:</a> </d

Test page URL

Enter an URL you want to work with and press Load

Page content box

Contains extracted HTML code of specified web page

Word wrap

No long strings scrolling

Visual expression builder

A dialog that helps you to view and navigate the web page to build expressions

Help

Leads you to http://www.webextractor.com/help.htm

OK

Apply expressions to specified session

Close

Close Custom Data Editor dialog

4. Visual expression builder

A dialog that helps you to view and navigate the web page to build expressions

Navigation

Use the address line as in your common browser. You can follow a new URL (after you copied it to the clipboard) just by holding the Refresh button for a moment.

Selection

Activates expression building mode by locking page navigation and disabling selection of active content. In this mode all links are blocked by default. Uncheck to navigate website like in your common browser with Java and frames enabled. When processing similar containers, here will be displayed a number of items found.

The whole Selection section menu can be displayed by the right click on page content.

Lock current page

Enables a one-time link following returning to expression building mode right after that.

Information Container Backlight

This option helps to distinguish information container you have selected. To select/deselect the backlighted container use double click.

Precise selection

Is helpful when you need to distinguish a container (or a group of containers) from a more general container. When done, click 1^st Item to Select.

Remove all unnecessary

Simplification of page display in case when there are problems selecting needed information

1^st /2^nd item to select

Determination of the minimal container for selected information unit.

Find all alike

An order for Visual Expression Builder to find all similar containers according to the search rule specified in an expression

Deselect

Deselect specified containers

Show more

Setting less strict rules of expression building. As a result, more similar containers with a common expression will be found.

Show less

Setting more strict rules of expression building. As a result, less similar containers with a common expression will be found.

Optimize an expression and apply

Reduce an expression to minimal value in such a way that units of information found will satay unchanged and apply it to Custom Data Editor. This helps WDE to work significantly faster.

Apply an expression without optimization

Apply an expression to Custom Data Editor without reducing it to minimal value

Flashing backlight of selected items

Helps to distinguish selected containers by flashing backlight

Show log

Opens Visual Expression Builder log file

Settings

Contains backlight, tool tips and history settings

Help

Leads you to http://www.webextractor.com/help.htm

In the bottom corners there is information on current process status.

Web Data Extractor Automatically extract specific information from web pages

Glossary

1. WDE Pro Toolbar

New session

Edit session

Start

Pause

Stop

Diagram

Options

General

Temp path

Output path

Encoding

Performance

Human factor

User agent

Search engines

Phone/fax parser options

Email prefix

Meta tags extraction

Valid domains

About

Process log

Results

Bad URLs

Stored sessions

Save

General

Session name

Group

Timeline

New session

Data source

Start URL

Depth

Negative depth:

Positive depth:

One page only

Positive

Negative

Keyword

Select search engines

URL list

File

Browse

Open

Page must contain keywords

Page must not contain keywords

Page must contain following text to extract data

Fixed number of pages

Authentication

First page authentication

Extraction data

Custom data

URLs, Domains, Meta Tags, Emails, Phones, Faxes

No duplicate domains

Except mail services

Preference (keep it duplicate)

No duplicate URL domains

Each email must match URL domain

Remove addresses in these domains

Collect domains without emails

Remove address if it contains words

Include address only if it contains words

Include address only if page title contains words

Custom data editor

Data1

Begins with; Contains; Ends with

Help

Start

Close

3. Custom data editor

Match case

Match new line

Begins with; Contains; Ends with

Test page URL

Page content box

Word wrap

Help

Web Data Extractor
Automatically extract specific information from web pages

1^st /2^nd item to select