Class PHPCrawlerURLFilter

Description

Class for filtering URLs by given filter-rules.

Located in /libs/PHPCrawler/PHPCrawlerURLFilter.class.php (line 8)


	
			
Variable Summary
Method Summary
static void keepRedirectUrls (PHPCrawlerDocumentInfo $DocumentInfo)
void addURLFilterRule ( $regex)
void addURLFilterRules ( $regex_array)
void addURLFollowRule ( $regex)
void filterUrls (PHPCrawlerDocumentInfo $DocumentInfo)
void setBaseURL (string $starting_url)
Variables
PHPCrawlerDocumentInfo $CurrentDocumentInfo = null (line 62)

Current PHPCrawlerDocumentInfo-object of the current document

  • access: protected
int $general_follow_mode = 2 (line 55)

The general follow-mode of the crawler

  • var:

    The follow-mode

    1. -> follow every links
    2. -> stay in domain
    3. -> stay in host
    4. -> stay in path

  • access: public
bool $obey_nofollow_tags = false (line 43)

Defines whether nofollow-tags should get obeyed.

  • access: public
string $starting_url = "" (line 15)

The full qualified and normalized URL the crawling-prpocess was started with.

  • access: protected
array $starting_url_parts = array() (line 22)

The URL-parts of the starting-url.

  • var: The URL-parts as returned by PHPCrawlerUtils::splitURL()
  • access: protected
array $url_filter_rules = array() (line 36)

Array containing regex-rules for URLs that should NOT be followed.

  • access: protected
array $url_follow_rules = array() (line 29)

Array containing regex-rules for URLs that should be followed.

  • access: protected
Methods
static method keepRedirectUrls (line 107)

Filters out all non-redirect-URLs from the URLs given in the PHPCrawlerDocumentInfo-object

  • access: public
static void keepRedirectUrls (PHPCrawlerDocumentInfo $DocumentInfo)
  • PHPCrawlerDocumentInfo $DocumentInfo: PHPCrawlerDocumentInfo-object containing all found links of the current document.
addURLFilterRule (line 217)

Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.

  • access: public
void addURLFilterRule ( $regex)
  • $regex
addURLFilterRules (line 231)

Adds a bunch of rules to the list of rules that decide which URLs found on a page should be ignored by the crawler.

  • access: public
void addURLFilterRules ( $regex_array)
  • $regex_array
addURLFollowRule (line 203)
  • access: public
void addURLFollowRule ( $regex)
  • $regex
filterUrls (line 82)

Filters the given URLs (contained in the given PHPCrawlerDocumentInfo-object) by the given rules.

  • access: public
void filterUrls (PHPCrawlerDocumentInfo $DocumentInfo)
  • PHPCrawlerDocumentInfo $DocumentInfo: PHPCrawlerDocumentInfo-object containing all found links of the current document.
setBaseURL (line 69)

Sets the base-URL of the crawling process some rules relate to

  • access: public
void setBaseURL (string $starting_url)
  • string $starting_url: The URL the crawling-process was started with.
urlMatchesRules (line 125)

Checks whether a given URL matches the rules.

  • return: TRUE if the URL matches the defined rules.
  • access: protected
bool urlMatchesRules (PHPCrawlerURLDescriptor $url)
  • string $url: The URL as a PHPCrawlerURLDescriptor-object

Documentation generated on Sun, 20 Jan 2013 21:18:50 +0200 by phpDocumentor 1.4.4