Docs For Class PHPCrawlerURLFilter

Description

Class for filtering URLs by given filter-rules.

Located in /libs/PHPCrawler/PHPCrawlerURLFilter.class.php (line 8)

Variable Summary

PHPCrawlerDocumentInfo $CurrentDocumentInfo

int $general_follow_mode

bool $obey_nofollow_tags

string $starting_url

array $starting_url_parts

array $url_filter_rules

array $url_follow_rules

Method Summary

static void keepRedirectUrls (PHPCrawlerDocumentInfo $DocumentInfo)

void addURLFilterRule ( $regex)

void addURLFilterRules ( $regex_array)

void addURLFollowRule ( $regex)

void filterUrls (PHPCrawlerDocumentInfo $DocumentInfo)

void setBaseURL (string $starting_url)

bool urlMatchesRules (PHPCrawlerURLDescriptor $url)

Variables

PHPCrawlerDocumentInfo $CurrentDocumentInfo = null (line 62)

Current PHPCrawlerDocumentInfo-object of the current document

access: protected

int $general_follow_mode = 2 (line 55)

The general follow-mode of the crawler

var:
The follow-mode
1. -> follow every links
2. -> stay in domain
3. -> stay in host
4. -> stay in path
access: public

bool $obey_nofollow_tags = false (line 43)

Defines whether nofollow-tags should get obeyed.

access: public

string $starting_url = "" (line 15)

The full qualified and normalized URL the crawling-prpocess was started with.

access: protected

array $starting_url_parts = array() (line 22)

The URL-parts of the starting-url.

var: The URL-parts as returned by PHPCrawlerUtils::splitURL()
access: protected

array $url_filter_rules = array() (line 36)

Array containing regex-rules for URLs that should NOT be followed.

access: protected

array $url_follow_rules = array() (line 29)

Array containing regex-rules for URLs that should be followed.

access: protected

Methods

static method keepRedirectUrls (line 107)

Filters out all non-redirect-URLs from the URLs given in the PHPCrawlerDocumentInfo-object

access: public

static void keepRedirectUrls (PHPCrawlerDocumentInfo $DocumentInfo)

PHPCrawlerDocumentInfo $DocumentInfo: PHPCrawlerDocumentInfo-object containing all found links of the current document.

addURLFilterRule (line 217)

Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.

access: public

void addURLFilterRule ( $regex)

$regex

addURLFilterRules (line 231)

Adds a bunch of rules to the list of rules that decide which URLs found on a page should be ignored by the crawler.

access: public

void addURLFilterRules ( $regex_array)

$regex_array

addURLFollowRule (line 203)

access: public

void addURLFollowRule ( $regex)

$regex

filterUrls (line 82)

Filters the given URLs (contained in the given PHPCrawlerDocumentInfo-object) by the given rules.

access: public

void filterUrls (PHPCrawlerDocumentInfo $DocumentInfo)

PHPCrawlerDocumentInfo $DocumentInfo: PHPCrawlerDocumentInfo-object containing all found links of the current document.

setBaseURL (line 69)

Sets the base-URL of the crawling process some rules relate to

access: public

void setBaseURL (string $starting_url)

string $starting_url: The URL the crawling-process was started with.

urlMatchesRules (line 125)

Checks whether a given URL matches the rules.

return: TRUE if the URL matches the defined rules.
access: protected

bool urlMatchesRules (PHPCrawlerURLDescriptor $url)

string $url: The URL as a PHPCrawlerURLDescriptor-object