Docs For Class PHPCrawler

Class PHPCrawler

Description

PHPCrawl mainclass

author: Uwe Hunfeld ([email protected])
version: 0.81

Located in /libs/PHPCrawler/PHPCrawler.class.php (line 10)

Direct descendents

Class	Description
SMCCrawler	Loading external PHPCrawler-class

Variable Summary

int $child_process_number

mixed $class_version

PHPCrawlerCookieCache $CookieCache

bool $cookie_handling_enabled

string $crawler_uniqid

PHPCrawlerDocumentInfoQueue $DocumentInfoQueue

int $document_limit

mixed $follow_redirects_till_content

mixed $is_chlid_process

mixed $is_parent_process

PHPCrawlerURLCache $LinkCache

mixed $link_priority_array

int $multiprocess_mode

mixed $obey_robots_txt

bool $only_count_received_documents

PHPCrawlerHTTPRequest $PageRequest

int $porcess_abort_reason

PHPCrawlerProcessCommunication $ProcessCommunication

PHPCrawlerDocumentInfoQueue $resumtion_enabled

PHPCrawlerRobotsTxtParser $RobotsTxtParser

string $starting_url

int $traffic_limit

mixed $urlcache_purged

PHPCrawlerURLFilter $UrlFilter

int $url_cache_type

PHPCrawlerUserSendDataCache $UserSendDataCache

string $working_base_directory

string $working_directory

Method Summary

PHPCrawler __construct ()

bool addBasicAuthentication (string $url_regex, string $username, string $password)

bool addContentTypeReceiveRule (string $regex)

void addFollowMatch ( $regex)

void addLinkExtractionTags ()

bool addLinkPriority (string $regex, int $level)

bool addLinkSearchContentType (string $regex)

void addNonFollowMatch ( $regex)

bool addPostData (string $url_regex, array $post_data_array)

void addReceiveContentType ( $regex)

void addReceiveToMemoryMatch ( $regex)

void addReceiveToTmpFileMatch ( $regex)

bool addStreamToFileContentType (string $regex)

bool addURLFilterRule (string $regex)

bool addURLFollowRule (string $regex)

int checkForAbort ()

void cleanup ()

void createWorkingDirectory ()

void disableExtendedLinkInfo ( $mode)

bool enableAggressiveLinkSearch (bool $mode)

bool enableCookieHandling (bool $mode)

void enableResumption ()

int getCrawlerId ()

PHPCrawlerProcessReport getProcessReport ()

void getReport ()

void go ()

void goMultiProcessed ([int $process_count = 3], [int $multiprocess_mode = 1])

int handleDocumentInfo (PHPCrawlerDocumentInfo $PageInfo)

int handleHeaderInfo (PHPCrawlerResponseHeader $header)

int handlePageData (array &$page_data)

void initChildProcess ()

void initCrawlerProcess ()

void obeyNoFollowTags (bool $mode)

bool obeyRobotsTxt (bool $mode)

void processRobotsTxt ()

bool processUrl (PHPCrawlerURLDescriptor $UrlDescriptor)

void resume (int $crawler_id)

void setAggressiveLinkExtraction ( $mode)

bool setConnectionTimeout (int $timeout)

bool setContentSizeLimit (int $bytes)

void setCookieHandling ( $mode)

bool setFollowMode (int $follow_mode)

bool setFollowRedirects (bool $mode)

void setFollowRedirectsTillContent (bool $mode)

void setLinkExtractionTags (array $tag_array)

void setPageLimit (int $limit, [bool $only_count_received_documents = false])

bool setPort (int $port)

void setProxy (string $proxy_host, int $proxy_port, [string $proxy_username = null], [string $proxy_password = null])

bool setStreamTimeout (int $timeout)

void setTmpFile ( $tmp_file)

bool setTrafficLimit (int $bytes, [bool $complete_requested_files = true])

bool setURL (string $url)

bool setUrlCacheType (int $url_cache_type)

void setUserAgentString (string $user_agent)

bool setWorkingDirectory (string $directory)

void starControllerProcessLoop ()

void startChildProcessLoop ()

Variables

int $child_process_number = null (line 152)

Number of child-process (NOT the PID!)

access: protected

mixed $class_version = "0.81" (line 12)

access: public

PHPCrawlerCookieCache $CookieCache (line 33)

The PHPCrawlerCookieCache-Object

access: protected

bool $cookie_handling_enabled = true (line 98)

Flag cookie-handling enabled/diabled

access: protected

string $crawler_uniqid = null (line 129)

UID of this instance of the crawler

access: protected

PHPCrawlerDocumentInfoQueue $DocumentInfoQueue = null (line 173)

DocumentInfoQueue-object

access: protected

int $document_limit = 0 (line 77)

Limit of documents to receive

access: protected

mixed $follow_redirects_till_content = true (line 175)

access: protected

mixed $is_chlid_process = false (line 110)

Flag indicating whether this instance is running in a child-process (if crawler runs multi-processed)

access: protected

mixed $is_parent_process = false (line 115)

Flag indicating whether this instance is running in the parent-process (if crawler runs multi-processed)

access: protected

PHPCrawlerURLCache $LinkCache (line 26)

The PHPCrawlerLinkCache-Object

access: public

mixed $link_priority_array = array() (line 145)

access: protected

int $multiprocess_mode = 0 (line 166)

Multiprocess-mode the crawler is runnung in.

var: One of the PHPCrawlerMultiProcessModes-constants
access: protected

mixed $obey_robots_txt = false (line 70)

Defines whether robots.txt-file should be obeyed

access: protected

bool $only_count_received_documents = true (line 91)

Defines if only documents that were received will be counted.

access: protected

PHPCrawlerHTTPRequest $PageRequest (line 19)

The PHPCrawlerHTTPRequest-Object

access: protected

int $porcess_abort_reason = null (line 105)

The reason why the process was aborted/finished.

var: One of the PHPCrawlerAbortReasons::ABORTREASON-constants.
access: protected

PHPCrawlerProcessCommunication $ProcessCommunication = null (line 159)

ProcessCommunication-object

access: protected

PHPCrawlerDocumentInfoQueue $resumtion_enabled = false (line 182)

Flag indicating whether resumtion is activated

access: protected

PHPCrawlerRobotsTxtParser $RobotsTxtParser (line 47)

The RobotsTxtParser-Object

access: protected

string $starting_url = "" (line 63)

The URL the crawler should start with.

The URL is full qualified and normalized.

access: protected

int $traffic_limit = 0 (line 84)

Limit of bytes to receive

var: The limit in bytes
access: protected

mixed $urlcache_purged = false (line 187)

Flag indicating whether the URL-cahce was purged at the beginning of a crawling-process

access: protected

PHPCrawlerURLFilter $UrlFilter (line 40)

The UrlFilter-Object

access: protected

int $url_cache_type = 1 (line 122)

URl cache-type.

var: One of the PHPCrawlerUrlCacheTypes::URLCACHE..-constants.
access: protected

PHPCrawlerUserSendDataCache $UserSendDataCache (line 54)

UserSendDataCahce-object.

access: protected

string $working_base_directory (line 136)

Base-directory for temporary directories

access: protected

string $working_directory = null (line 143)

Complete path to the temporary directory

access: protected

Methods

Constructor __construct (line 192)

Initiates a new crawler.

access: public

PHPCrawler __construct ()

addBasicAuthentication (line 1549)

Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.

Example:

$crawler->addBasicAuthentication("#http://www\.foo\.com/protected_path/#", "myusername", "mypasswd");

This lets the crawler send the authentication "myusername/mypasswd" with every request for content placed in the path "protected_path" on the host "www.foo.com".

section: 10 Other settings
access: public

bool addBasicAuthentication (string $url_regex, string $username, string $password)

string $url_regex: Regular-expression defining the URL(s) the authentication should be send to.
string $username: The username
string $password: The password

addContentTypeReceiveRule (line 1182)

Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received

After receiving the HTTP-header of a followed URL, the crawler check's - based on the given rules - whether the content of that URL should be received. If no rule matches with the content-type of the document, the content won't be received.

Example:

$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#text/css#");

This rules lets the crawler receive the content/source of pages with the Content-Type "text/html" AND "text/css". Other pages or files with different content-types (e.g. "image/gif") won't be received (if this is the only rule added to the list).

IMPORTANT: By default, if no rule was added to the list, the crawler receives every content.

Note: To reduce the traffic the crawler will cause, you only should add content-types of pages/files you really want to receive. But at least you should add the content-type "text/html" to this list, otherwise the crawler can't find any links.

return: TRUE if the rule was added to the list. FALSE if the given regex is not valid.
section: 2 Filter-settings
access: public

bool addContentTypeReceiveRule (string $regex)

string $regex: The rule as a regular-expression

addFollowMatch (line 1255)

Alias for addURLFollowRule().

deprecated:
section: 11 Deprecated
access: public

void addFollowMatch ( $regex)

$regex

addLinkExtractionTags (line 1525)

Sets the list of html-tags from which links should be extracted from.

This method was named wrong in previous versions of phpcrawl. It does not ADD tags, it SETS the tags from which links should be extracted from.

Example

$crawler->addLinkExtractionTags("href", "src");

deprecated: Please use setLinkExtractionTags()
section: 11 Deprecated
access: public

void addLinkExtractionTags ()

addLinkPriority (line 1073)

Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.

Links/URLs that match an expression with a high priority-level will be followed before links with a lower level. All links that don't match with any of the given rules will get the level 0 (lowest level) automatically.

The level can be any positive integer.

Example:

Telling the crawler to follow links that contain the string "forum" before links that contain ".gif" before all other found links.

$crawler->addLinkPriority("/forum/", 10);
$cralwer->addLinkPriority("/\.gif/", 5);

return: TRUE if a valid preg-pattern is given as argument and was succsessfully added, otherwise it returns FALSE.
section: 10 Other settings

bool addLinkPriority (string $regex, int $level)

string $regex: Regular expression definig the rule
int $level: The priority-level

addLinkSearchContentType (line 1700)

Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)

By default the crawler ONLY searches for links in documents of type "text/html". Use this method to add one or more other content-types the crawler should check for links.

Example:

$crawler->addLinkSearchContentType("#text/css# i");
$crawler->addLinkSearchContentType("#text/xml# i");

These rules let the crawler search for links in HTML-, CSS- ans XML-documents.

Please note: It is NOT recommended to let the crawler checkfor links in EVERY document- type! This could slow down the crawling-process dramatically (e.g. if the crawler receives large binary-files like images and tries to find links in them).

return: TRUE if the rule was successfully added
section: 6 Linkfinding settings
access: public

bool addLinkSearchContentType (string $regex)

string $regex: Regular-expression defining the rule

addNonFollowMatch (line 1267)

Alias for addURLFilterRule().

deprecated:
section: 11 Deprecated
access: public

void addNonFollowMatch ( $regex)

$regex

addPostData (line 1781)

Adds post-data together with an URL-rule to the list of post-data to send with requests.

Example

$post_data = array("username" => "me", "password" => "my_password", "action" => "do_login");
$crawler->addPostData("#http://www\.foo\.com/login.php#", $post_data);

This example sends the post-values "username=me", "password=my_password" and "action=do_login" to the URL http://www.foo.com/login.php

section: 10 Other settings
access: public

bool addPostData (string $url_regex, array $post_data_array)

string $url_regex: Regular expression defining the URL(s) the post-data should be send to.
array $post_data_array: Post-data-array, the array-keys are the post-data-keys, the array-values the post-values. (like array("post_key1" => "post_value1", "post_key2" => "post_value2")

addReceiveContentType (line 1194)

Alias for addContentTypeReceiveRule().

deprecated:
section: 11 Deprecated
access: public

void addReceiveContentType ( $regex)

$regex

addReceiveToMemoryMatch (line 1363)

Has no function anymore!

This method was redundant, please use addStreamToFileContentType(). It just still exists because of compatibility-reasons.

deprecated: This method has no function anymore since v 0.8.
section: 11 Deprecated
access: public

void addReceiveToMemoryMatch ( $regex)

$regex

addReceiveToTmpFileMatch (line 1349)

Alias for addStreamToFileContentType().

deprecated:
section: 11 Deprecated
access: public

void addReceiveToTmpFileMatch ( $regex)

$regex

addStreamToFileContentType (line 1300)

Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.

If a content-type of a page or file matches with one of these rules, the content will be streamed directly into a temporary file without claiming local RAM.

It's recommendend to add all content-types of files that may be of bigger size to prevent memory-overflows. By default the crawler will receive every content to memory!

The content/source of pages and files that were streamed to file are not accessible directly within the overidden method handleDocumentInfo(), instead you get information about the file the content was stored in. (see properties PHPCrawlerDocumentInfo::received_to_file and PHPCrawlerDocumentInfo::content_tmp_file).

Please note that this setting doesn't effect the link-finding results, also file-streams will be checked for links.

A common setup may look like this example:

// Basically let the crawler receive every content (default-setting)
$crawler->addReceiveContentType("##");
// Tell the crawler to stream everything but "text/html"-documents to a tmp-file
$crawler->addStreamToFileContentType("#^((?!text/html).)*$#");

return: TRUE if the rule was added to the list and the regex is valid.
section: 10 Other settings
access: public

bool addStreamToFileContentType (string $regex)

string $regex: The rule as a regular-expression

addURLFilterRule (line 1243)

Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.

If the crawler finds an URL and this URL matches with one of the given regular-expressions, the crawler will ignore this URL and won't follow it.

Example:

$crawler->addURLFilterRule("#(jpg|jpeg|gif|png|bmp)$# i");
$crawler->addURLFilterRule("#(css|js)$# i");

These rules let the crawler ignore URLs that end with "jpg", "jpeg", "gif", ..., "css" and "js".

return: TRUE if the regex is valid and the rule was added to the list, otherwise FALSE.
section: 2 Filter-settings
access: public

bool addURLFilterRule (string $regex)

string $regex: Regular-expression defining the rule

addURLFollowRule (line 1220)

Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.

If the crawler finds an URL and this URL doesn't match with any of the given regular-expressions, the crawler will ignore this URL and won't follow it.

NOTE: By default and if no rule was added to this list, the crawler will NOT filter ANY URLs, every URL the crawler finds will be followed (except the ones "excluded" by other options of course).

Example:

$crawler->addURLFollowRule("#(htm|html)$# i");
$crawler->addURLFollowRule("#(php|php3|php4|php5)$# i");

These rules let the crawler ONLY follow URLs/links that end with "html", "htm", "php", "php3" etc.

return: TRUE if the regex is valid and the rule was added to the list, otherwise FALSE.
section: 2 Filter-settings
access: public

bool addURLFollowRule (string $regex)

string $regex: Regular-expression defining the rule

checkForAbort (line 730)

Checks if the crawling-process should be aborted.

return: NULL if the process shouldn't be aborted yet, otherwise one of the PHPCrawlerAbortReasons::ABORTREASON-constants.
access: protected

int checkForAbort ()

cleanup (line 795)

Cleans up the crawler after it has finished.

access: protected

void cleanup ()

createWorkingDirectory (line 775)

Creates the working-directory for this instance of the cralwer.

access: protected

void createWorkingDirectory ()

disableExtendedLinkInfo (line 1574)

Has no function anymore.

Thes method has no function anymore, just still exists because of compatibility-reasons.

deprecated:
section: 11 Deprecated
access: public

void disableExtendedLinkInfo ( $mode)

$mode

enableAggressiveLinkSearch (line 1477)

Enables or disables agressive link-searching.

If this is set to FALSE, the crawler tries to find links only inside html-tags (< and >). If this is set to TRUE, the crawler tries to find links everywhere in an html-page, even outside of html-tags. The default value is TRUE.

Please note that if agressive-link-searching is enabled, it happens that the crawler will find links that are not meant as links and it also happens that it finds links in script-parts of pages that can't be rebuild correctly - since there is no javascript-parser/interpreter implemented. (E.g. javascript-code like document.location.href= a_var + ".html").

Disabling agressive-link-searchingn results in a better crawling-performance.

section: 6 Linkfinding settings
access: public

bool enableAggressiveLinkSearch (bool $mode)

bool $mode

enableCookieHandling (line 1441)

Enables or disables cookie-handling.

If cookie-handling is set to TRUE, the crawler will handle all cookies sent by webservers just like a common browser does. The default-value is TRUE.

It's strongly recommended to set or leave the cookie-handling enabled!

section: 10 Other settings
access: public

bool enableCookieHandling (bool $mode)

bool $mode

enableResumption (line 1875)

Prepares the crawler for process-resumption.

In order to be able to resume an aborted/terminated crawling-process, it is necessary to initially call the enableResumption() method in your script/project.

For further details on how to resume aborted processes please see the documentation of the resume() method.

section: 9 Process resumption
access: public

void enableResumption ()

getCrawlerId (line 1792)

Returns the unique ID of the instance of the crawler

section: 9 Process resumption
access: public

int getCrawlerId ()

getProcessReport (line 814)

Retruns summarizing report-information about the crawling-process after it has finished.

return: PHPCrawlerProcessReport-object containing process-summary-information
section: 1 Basic settings
access: public

PHPCrawlerProcessReport getProcessReport ()

getReport (line 857)

Retruns an array with summarizing report-information after the crawling-process has finished

For detailed information on the conatining array-keys see PHPCrawlerProcessReport-class.

deprecated: Please use getProcessReport() instead.
section: 11 Deprecated
access: public

void getReport ()

go (line 324)

Starts the crawling process in single-process-mode.

Be sure you did override the handleDocumentInfo()- or handlePageData()-method before calling the go()-method to process the documents the crawler finds.

section: 1 Basic settings
access: public

void go ()

goMultiProcessed (line 387)

Starts the cralwer by using multi processes.

When using this method instead of the go()-method to start the crawler, phpcrawl will use the given number of processes simultaneously for spidering the target-url. Using multi processes will speed up the crawling-progress dramatically in most cases.

There are some requirements though to successfully run the cralwler in multi-process mode:

The multi-process mode only works on unix-based systems (linux)
Scripts using the crawler have to be run from the commandline (cli)
The <a href="http://php.net/manual/en/pcntl.installation.php">PCNTL-extension</a> for php (process control) has to be installed and activated.
The <a href="http://php.net/manual/en/sem.installation.php">SEMAPHORE-extension</a> for php has to be installed and activated.
The <a href="http://de.php.net/manual/en/posix.installation.php">POSIX-extension</a> for php has to be installed and activated.
The <a href="http://de2.php.net/manual/en/pdo.installation.php">PDO-extension</a> together with the SQLite-driver (PDO_SQLITE) has to be installed and activated.

PHPCrawls supports two different modes of multiprocessing:

PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE The cralwer uses multi processes simultaneously for spidering the target URL, but the usercode provided to the overridable function handleDocumentInfo() gets always executed on the same main-process. This means that the usercode never gets executed simultaneously and so you dont't have to care about concurrent file/database/handle-accesses or smimilar things. But on the other side the usercode may slow down the crawling-procedure because every child-process has to wait until the usercode got executed on the main-process. This ist the recommended multiprocess-mode!
PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE The cralwer uses multi processes simultaneously for spidering the target URL, and every chld-process executes the usercode provided to the overridable function handleDocumentInfo() directly from it's process. This means that the usercode gets executed simultaneously by the different child-processes and you should take care of concurrent file/data/handle-accesses proberbly (if used). When using this mode and you use any handles like database-connections or filestreams in your extended crawler-class, you should open them within the overridden mehtod initChildProcess() instead of opening them from the constructor. For more details see the documentation of the initChildProcess()-method.

Example for starting the crawler with 5 processes using the recommended MPMODE_PARENT_EXECUTES_USERCODE-mode:

$crawler->goMultiProcessed(5, PHPCrawlerMultiProcessModes::MPMODE_PARENT_EXECUTES_USERCODE);

Please note that increasing the number of processes to high values does't automatically mean that the crawling-process will go off faster! Using 3 to 5 processes should be good values to start from.

section: 1 Basic settings
access: public

void goMultiProcessed ([int $process_count = 3], [int $multiprocess_mode = 1])

int $process_count: Number of processes to use
int $multiprocess_mode: The multiprocess-mode to use. One of the PHPCrawlerMultiProcessModes-constants

handleDocumentInfo (line 990)

Override this method to get access to all information about a page or file the crawler found and received.

Everytime the crawler found and received a document on it's way this method will be called. The crawler passes all information about the currently received page or file to this method by a PHPCrawlerDocumentInfo-object.

Please see the PHPCrawlerDocumentInfo documentation for a list of all properties describing the html-document.

Example:

class MyCrawler extends PHPCrawler
{
function handleDocumentInfo($PageInfo)
{
// Print the URL of the document
echo "URL: ".$PageInfo->url."<br />";
// Print the http-status-code
echo "HTTP-statuscode: ".$PageInfo->http_status_code."<br />";
// Print the number of found links in this document
echo "Links found: ".count($PageInfo->links_found_url_descriptors)."<br />";
// ..
}
}

return: The crawling-process will stop immedeatly if you let this method return any negative value.
section: 3 Overridable methods / User data-processing
access: public

int handleDocumentInfo (PHPCrawlerDocumentInfo $PageInfo)

PHPCrawlerDocumentInfo $PageInfo: A PHPCrawlerDocumentInfo-object containing all information about the currently received document. Please see the reference of the PHPCrawlerDocumentInfo-class for detailed information.

Redefined in descendants as:

SMCCrawler::handleDocumentInfo() : get access to all information about a page or file the crawler found and received.

handleHeaderInfo (line 893)

Overridable method that will be called after the header of a document was received and BEFORE the content will be received.

Everytime a header of a document was received, the crawler will call this method. If this method returns any negative integer, the crawler will NOT reveice the content of the particular page or file.

Example:

class MyCrawler extends PHPCrawler
{
function handleHeaderInfo(PHPCrawlerResponseHeader $header)
{
// If the content-type of the document isn't "text/html" -> don't receive it.
if ($header->content_type != "text/html")
{
return -1;
}
}
function handleDocumentInfo($PageInfo)
{
// ...
}
}

return: The document won't be received if you let this method return any negative value.
section: 3 Overridable methods / User data-processing
access: public

int handleHeaderInfo (PHPCrawlerResponseHeader $header)

PHPCrawlerResponseHeader $header: The header as PHPCrawlerResponseHeader-object

handlePageData (line 952)

Override this method to get access to all information about a page or file the crawler found and received.

return: The crawling-process will stop immedeatly if you let this method return any negative value.
deprecated: Please use and override the handleDocumentInfo-method to access document-information instead.
section: 3 Overridable methods / User data-processing
access: public

int handlePageData (array &$page_data)

array &$page_data: Array containing all information about the currently received document. For detailed information on the conatining keys see PHPCrawlerDocumentInfo-class.

initChildProcess (line 935)

Overridable method that will be called by every used child-process just before it starts the crawling-procedure.

Every child-process of the crawler will call this method just before it starts it's crawling-loop from within it's process-context.

So when using the multi-process mode "PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE", this method should be overidden and used to open any needed database-connections, file streams or other similar handles to ensure that they will get opened and accessible for every used child-process.

Example:

class MyCrawler extends PHPCrawler
{
protected $mysql_link;
function initChildProcess()
{
// Open a database-connection for every used process
$this->mysql_link = mysql_connect("myhost", "myusername", "mypassword");
mysql_select_db ("mydatabasename", $this->mysql_link);
}
function handleDocumentInfo($PageInfo)
{
mysql_query("INSERT INTO urls SET url = '".$PageInfo->url."';", $this->mysql_link);
}
}
// Start crawler with 5 processes
$crawler = new MyCrawler();
$crawler->setURL("http://www.any-url.com");
$crawler->goMultiProcessed(5, PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

section: 3 Overridable methods / User data-processing
access: public

void initChildProcess ()

initCrawlerProcess (line 273)

Initiates a crawler-process

access: protected

void initCrawlerProcess ()

obeyNoFollowTags (line 1758)

Decides whether the crawler should obey "nofollow"-tags

If set to TRUE, the crawler will not follow links that a marked with rel="nofollow" (like <a href="page.html" rel="nofollow">) nor links from pages containing the meta-tag <meta name="robots" content="nofollow">.

By default, the crawler will NOT obey nofollow-tags.

section: 2 Filter-settings
access: public

void obeyNoFollowTags (bool $mode)

bool $mode: If set to TRUE, the crawler will obey "nofollow"-tags

obeyRobotsTxt (line 1335)

Decides whether the crawler should parse and obey robots.txt-files.

If this is set to TRUE, the crawler looks for a robots.txt-file for every host that sites or files should be received from during the crawling process. If a robots.txt-file for a host was found, the containig directives appliying to the useragent-identification of the cralwer ("PHPCrawl" or manually set by calling setUserAgentString()) will be obeyed.

The default-value is FALSE (for compatibility reasons).

Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user. If e.g. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.

section: 2 Filter-settings
access: public

bool obeyRobotsTxt (bool $mode)

bool $mode: Set to TRUE if you want the crawler to obey robots.txt-files.

processRobotsTxt (line 717)

access: protected

void processRobotsTxt ()

processUrl (line 601)

Receives and processes the given URL

return: TURE if the crawling-process should be aborted after processig the URL, otherwise FALSE.
access: protected

bool processUrl (PHPCrawlerURLDescriptor $UrlDescriptor)

PHPCrawlerURLDescriptor $UrlDescriptor: The URL as PHPCrawlerURLDescriptor-object

resume (line 1846)

Resumes the crawling-process with the given crawler-ID

If a crawling-process was aborted (for whatever reasons), it is possible to resume it by calling the resume()-method before calling the go() or goMultiProcessed() method and passing the crawler-ID of the aborted process to it (as returned by getCrawlerId()).

In order to be able to resume a process, it is necessary that it was initially started with resumption enabled (by calling the enableResumption() method).

This method throws an exception if resuming of a crawling-process failed.

Example of a resumeable crawler-script:

// ...
$crawler = new MyCrawler();
$crawler->enableResumption();
$crawler->setURL("www.url123.com");
// If process was started the first time:
// Get the crawler-ID and store it somewhere in order to be able to resume the process later on
if (!file_exists("/tmp/crawlerid_for_url123.tmp"))
{
$crawler_id = $crawler->getCrawlerId();
file_put_contents("/tmp/crawlerid_for_url123.tmp", $crawler_id);
}
// If process was restarted again (after a termination):
// Read the crawler-id and resume the process
else
{
$crawler_id = file_get_contents("/tmp/crawlerid_for_url123.tmp");
$crawler->resume($crawler_id);
}
// ...
// Start your crawling process
$crawler->goMultiProcessed(5);
// After the process is finished completely: Delete the crawler-ID
unlink("/tmp/crawlerid_for_url123.tmp");

section: 9 Process resumption
access: public

void resume (int $crawler_id)

int $crawler_id: The crawler-ID of the crawling-process that should be resumed. (see getCrawlerId())

setAggressiveLinkExtraction (line 1488)

Alias for enableAggressiveLinkSearch()

deprecated: Please use enableAggressiveLinkSearch()
section: 11 Deprecated
access: public

void setAggressiveLinkExtraction ( $mode)

$mode

setConnectionTimeout (line 1640)

Sets the timeout in seconds for connection tries to hosting webservers.

If the the connection to a host can't be established within the given time, the request will be aborted.

section: 10 Other settings
access: public

bool setConnectionTimeout (int $timeout)

int $timeout: The timeout in seconds, the default-value is 5 seconds.

setContentSizeLimit (line 1402)

Sets the content-size-limit for content the crawler should receive from documents.

If the crawler is receiving the content of a page or file and the contentsize-limit is reached, the crawler stops receiving content from this page or file.

Please note that the crawler can only find links in the received portion of a document.

The default-value is 0 (no limit).

section: 5 Limit-settings
access: public

bool setContentSizeLimit (int $bytes)

int $bytes: The limit in bytes.

setCookieHandling (line 1455)

Alias for enableCookieHandling()

deprecated: Please use enableCookieHandling()
section: 11 Deprecated
access: public

void setCookieHandling ( $mode)

$mode

setFollowMode (line 1148)

Sets the basic follow-mode of the crawler.

The following list explains the supported follow-modes:

0 - The crawler will follow EVERY link, even if the link leads to a different host or domain. If you choose this mode, you really should set a limit to the crawling-process (see limit-options), otherwise the crawler maybe will crawl the whole WWW!

1 - The crawler only follow links that lead to the same domain like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will follow links to "http://www.foo.com/..." and "http://bar.foo.com/...", but not to "http://www.another-domain.com/...".

2 - The crawler will only follow links that lead to the same host like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will ONLY follow links to "http://www.foo.com/...", but not to "http://bar.foo.com/..." and "http://www.another-domain.com/...". This is the default mode.

3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url. E.g. if the root-url is "http://www.foo.com/bar/index.html", the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html", but not links to "http://www.foo.com/page.html".

section: 1 Basic settings
access: public

bool setFollowMode (int $follow_mode)

int $follow_mode: The basic follow-mode for the crawling-process (0, 1, 2 or 3).

setFollowRedirects (line 1095)

Defines whether the crawler should follow redirects sent with headers by a webserver or not.

section: 10 Other settings
access: public

bool setFollowRedirects (bool $mode)

bool $mode: If TRUE, the crawler will follow header-redirects. The default-value is TRUE.

setFollowRedirectsTillContent (line 1117)

Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.

Sometimes, when requesting an URL, the first thing the webserver does is sending a redirect to another location, and sometimes the server of this new location is sending a redirect again (and so on). So at least its possible that you find the expected content on a totally different host as expected.

If you set this option to TRUE, the crawler will follow all these redirects until it finds some content. If content finally was found, the root-url of the crawling-process will be set to this url and all defined options (folllow-mode, filter-rules etc.) will relate to it from now on.

section: 10 Other settings
access: public

void setFollowRedirectsTillContent (bool $mode)

bool $mode: If TRUE, the crawler will follow redirects until content was finally found. Defaults to TRUE.

setLinkExtractionTags (line 1508)

Sets the list of html-tags the crawler should search for links in.

By default the crawler searches for links in the following html-tags: href, src, url, location, codebase, background, data, profile, action and open. As soon as the list is set manually, this default list will be overwritten completly.

Example:

$crawler->setLinkExtractionTags(array("href", "src"));

This setting lets the crawler search for links (only) in "href" and "src"-tags.

Note: Reducing the number of tags in this list will improve the crawling-performance (a little).

section: 6 Linkfinding settings
access: public

void setLinkExtractionTags (array $tag_array)

array $tag_array: Numeric array containing the tags.

setPageLimit (line 1379)

Sets a limit to the number of pages/files the crawler should follow.

If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).

section: 5 Limit-settings
access: public

void setPageLimit (int $limit, [bool $only_count_received_documents = false])

int $limit: The limit, set to 0 for no limit (default value).
bool $only_count_received_documents: OPTIONAL. TRUE means that only documents the crawler received will be counted. FALSE means that ALL followed and requested pages/files will be counted, even if the content wasn't be received.

setPort (line 1038)

Sets the port to connect to for crawling the starting-url set in setUrl().

The default port is 80.

Note:

$cralwer->setURL("http://www.foo.com");
$crawler->setPort(443);

effects the same as

$cralwer->setURL("http://www.foo.com:443");

section: 1 Basic settings
access: public

bool setPort (int $port)

int $port: The port

setProxy (line 1624)

Assigns a proxy-server the crawler should use for all HTTP-Requests.

section: 10 Other settings
access: public

void setProxy (string $proxy_host, int $proxy_port, [string $proxy_username = null], [string $proxy_password = null])

string $proxy_host: Hostname or IP of the proxy-server
int $proxy_port: Port of the proxy-server
string $proxy_username: Optional. The username for proxy-authentication or NULL if no authentication is required.
string $proxy_password: Optional. The password for proxy-authentication or NULL if no authentication is required.

setStreamTimeout (line 1664)

Sets the timeout in seconds for waiting for data on an established server-connection.

If the connection to a server was be etablished but the server doesnt't send data anymore without closing the connection, the crawler will wait the time given in timeout and then close the connection.

section: 10 Other settings
access: public

bool setStreamTimeout (int $timeout)

int $timeout: The timeout in seconds, the default-value is 2 seconds.

setTmpFile (line 1313)

Has no function anymore.

Please use setWorkingDirectory()

deprecated: This method has no function anymore since v 0.8.
section: 11 Deprecated
access: public

void setTmpFile ( $tmp_file)

$tmp_file

setTrafficLimit (line 1419)

Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.

If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).

section: 5 Limit-settings
access: public

bool setTrafficLimit (int $bytes, [bool $complete_requested_files = true])

int $bytes: Maximum number of bytes
bool $complete_requested_files: This parameter has no function anymore!

setURL (line 1006)

Sets the URL of the first page the crawler should crawl (root-page).

The given url may contain the protocol (http://www.foo.com or https://www.foo.com), the port (http://www.foo.com:4500/index.php) and/or basic-authentication-data (http://loginname:[email protected])

This url has to be set before calling the go()-method (of course)! If this root-page doesn't contain any further links, the crawling-process will stop immediately.

section: 1 Basic settings
access: public

bool setURL (string $url)

string $url: The URL

setUrlCacheType (line 1736)

Defines what type of cache will be internally used for caching URLs.

Currently phpcrawl is able to use a in-memory-cache or a SQlite-database-cache for caching/storing found URLs internally.

The memory-cache (PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY) is recommended for spidering small to medium websites. It provides better performance, but the php-memory-limit may be hit when too many URLs get added to the cache. This is the default-setting.

The SQlite-cache (PHPCrawlerUrlCacheTypes::URLCACHE_SQLite) is recommended for spidering huge websites. URLs get cached in a SQLite-database-file, so the cache only is limited by available harddisk-space. To increase performance of the SQLite-cache you may set it's location to a shared-memory device like "/dev/shm/" by using the setWorkingDirectory()-method.

Example:

$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->setWorkingDirectory("/dev/shm/");

NOTE: When using phpcrawl in multi-process-mode (goMultiProcessed()), the cache-type is automatically set to PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE.

section: 1 Basic settings
access: public

bool setUrlCacheType (int $url_cache_type)

int $url_cache_type:
1 -> in-memory-cache (default setting) 2 -> SQlite-database-cache
Or one of the PHPCrawlerUrlCacheTypes::URLCACHE..-constants.

setUserAgentString (line 1560)

Sets the "User-Agent" identification-string that will be send with HTTP-requests.

section: 10 Other settings
access: public

void setUserAgentString (string $user_agent)

string $user_agent: The user-agent-string. The default-value is "PHPCrawl".

setWorkingDirectory (line 1604)

Sets the working-directory the crawler should use for storing temporary data.

Every instance of the crawler needs and creates a temporary directory for storing some internal data.

This setting defines which base-directory the crawler will use to store the temporary directories in. By default, the crawler uses the systems temp-directory as working-directory. (i.e. "/tmp/" on linux-systems)

All temporary directories created in the working-directory will be deleted automatically after a crawling-process has finished.

NOTE: To speed up the performance of a crawling-process (especially when using the SQLite-urlcache), try to set a mounted shared-memory device as working-direcotry (i.e. "/dev/shm/" on Debian/Ubuntu-systems).

Example:

$crawler->setWorkingDirectory("/tmp/");

return: TRUE on success, otherwise false.
section: 1 Basic settings
access: public

bool setWorkingDirectory (string $directory)

string $directory: The working-directory

starControllerProcessLoop (line 475)

Starts the loop of the controller-process (main-process).

access: protected

void starControllerProcessLoop ()

startChildProcessLoop (line 530)

Starts the loop of a child-process.

access: protected

void startChildProcessLoop ()

Documentation generated on Sun, 20 Jan 2013 21:18:49 +0200 by phpDocumentor 1.4.4