PHPCrawl mainclass
Located in /libs/PHPCrawler/PHPCrawler.class.php (line 10)
Class | Description |
---|---|
SMCCrawler | Loading external PHPCrawler-class |
Number of child-process (NOT the PID!)
The PHPCrawlerCookieCache-Object
Flag cookie-handling enabled/diabled
UID of this instance of the crawler
DocumentInfoQueue-object
Limit of documents to receive
Flag indicating whether this instance is running in a child-process (if crawler runs multi-processed)
Flag indicating whether this instance is running in the parent-process (if crawler runs multi-processed)
The PHPCrawlerLinkCache-Object
Multiprocess-mode the crawler is runnung in.
Defines whether robots.txt-file should be obeyed
Defines if only documents that were received will be counted.
The PHPCrawlerHTTPRequest-Object
The reason why the process was aborted/finished.
ProcessCommunication-object
Flag indicating whether resumtion is activated
The RobotsTxtParser-Object
The URL the crawler should start with.
The URL is full qualified and normalized.
Limit of bytes to receive
Flag indicating whether the URL-cahce was purged at the beginning of a crawling-process
The UrlFilter-Object
URl cache-type.
UserSendDataCahce-object.
Base-directory for temporary directories
Complete path to the temporary directory
Initiates a new crawler.
Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
Example:
Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
After receiving the HTTP-header of a followed URL, the crawler check's - based on the given rules - whether the content of that URL should be received. If no rule matches with the content-type of the document, the content won't be received.
Example:
IMPORTANT: By default, if no rule was added to the list, the crawler receives every content.
Note: To reduce the traffic the crawler will cause, you only should add content-types of pages/files you really want to receive. But at least you should add the content-type "text/html" to this list, otherwise the crawler can't find any links.
Alias for addURLFollowRule().
Sets the list of html-tags from which links should be extracted from.
This method was named wrong in previous versions of phpcrawl. It does not ADD tags, it SETS the tags from which links should be extracted from.
Example
Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
Links/URLs that match an expression with a high priority-level will be followed before links with a lower level. All links that don't match with any of the given rules will get the level 0 (lowest level) automatically.
The level can be any positive integer.
Example:
Telling the crawler to follow links that contain the string "forum" before links that contain ".gif" before all other found links.
Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
By default the crawler ONLY searches for links in documents of type "text/html". Use this method to add one or more other content-types the crawler should check for links.
Example:
Please note: It is NOT recommended to let the crawler checkfor links in EVERY document- type! This could slow down the crawling-process dramatically (e.g. if the crawler receives large binary-files like images and tries to find links in them).
Alias for addURLFilterRule().
Adds post-data together with an URL-rule to the list of post-data to send with requests.
Example
Alias for addContentTypeReceiveRule().
Has no function anymore!
This method was redundant, please use addStreamToFileContentType(). It just still exists because of compatibility-reasons.
Alias for addStreamToFileContentType().
Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
If a content-type of a page or file matches with one of these rules, the content will be streamed directly into a temporary file without claiming local RAM.
It's recommendend to add all content-types of files that may be of bigger size to prevent memory-overflows. By default the crawler will receive every content to memory!
The content/source of pages and files that were streamed to file are not accessible directly within the overidden method handleDocumentInfo(), instead you get information about the file the content was stored in. (see properties PHPCrawlerDocumentInfo::received_to_file and PHPCrawlerDocumentInfo::content_tmp_file).
Please note that this setting doesn't effect the link-finding results, also file-streams will be checked for links.
A common setup may look like this example:
Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
If the crawler finds an URL and this URL matches with one of the given regular-expressions, the crawler will ignore this URL and won't follow it.
Example:
Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
If the crawler finds an URL and this URL doesn't match with any of the given regular-expressions, the crawler will ignore this URL and won't follow it.
NOTE: By default and if no rule was added to this list, the crawler will NOT filter ANY URLs, every URL the crawler finds will be followed (except the ones "excluded" by other options of course).
Example:
Checks if the crawling-process should be aborted.
Cleans up the crawler after it has finished.
Creates the working-directory for this instance of the cralwer.
Has no function anymore.
Thes method has no function anymore, just still exists because of compatibility-reasons.
Enables or disables agressive link-searching.
If this is set to FALSE, the crawler tries to find links only inside html-tags (< and >). If this is set to TRUE, the crawler tries to find links everywhere in an html-page, even outside of html-tags. The default value is TRUE.
Please note that if agressive-link-searching is enabled, it happens that the crawler will find links that are not meant as links and it also happens that it finds links in script-parts of pages that can't be rebuild correctly - since there is no javascript-parser/interpreter implemented. (E.g. javascript-code like document.location.href= a_var + ".html").
Disabling agressive-link-searchingn results in a better crawling-performance.
Enables or disables cookie-handling.
If cookie-handling is set to TRUE, the crawler will handle all cookies sent by webservers just like a common browser does. The default-value is TRUE.
It's strongly recommended to set or leave the cookie-handling enabled!
Prepares the crawler for process-resumption.
In order to be able to resume an aborted/terminated crawling-process, it is necessary to initially call the enableResumption() method in your script/project.
For further details on how to resume aborted processes please see the documentation of the resume() method.
Returns the unique ID of the instance of the crawler
Retruns summarizing report-information about the crawling-process after it has finished.
Retruns an array with summarizing report-information after the crawling-process has finished
For detailed information on the conatining array-keys see PHPCrawlerProcessReport-class.
Starts the crawling process in single-process-mode.
Be sure you did override the handleDocumentInfo()- or handlePageData()-method before calling the go()-method to process the documents the crawler finds.
Starts the cralwer by using multi processes.
When using this method instead of the go()-method to start the crawler, phpcrawl will use the given number of processes simultaneously for spidering the target-url. Using multi processes will speed up the crawling-progress dramatically in most cases.
There are some requirements though to successfully run the cralwler in multi-process mode:
PHPCrawls supports two different modes of multiprocessing:
Example for starting the crawler with 5 processes using the recommended MPMODE_PARENT_EXECUTES_USERCODE-mode:
Please note that increasing the number of processes to high values does't automatically mean that the crawling-process will go off faster! Using 3 to 5 processes should be good values to start from.
Override this method to get access to all information about a page or file the crawler found and received.
Everytime the crawler found and received a document on it's way this method will be called. The crawler passes all information about the currently received page or file to this method by a PHPCrawlerDocumentInfo-object.
Please see the PHPCrawlerDocumentInfo documentation for a list of all properties describing the html-document.
Example:
Overridable method that will be called after the header of a document was received and BEFORE the content will be received.
Everytime a header of a document was received, the crawler will call this method. If this method returns any negative integer, the crawler will NOT reveice the content of the particular page or file.
Example:
Override this method to get access to all information about a page or file the crawler found and received.
Everytime the crawler found and received a document on it's way this method will be called. The crawler passes all information about the currently received page or file to this method by the array $page_data.
Overridable method that will be called by every used child-process just before it starts the crawling-procedure.
Every child-process of the crawler will call this method just before it starts it's crawling-loop from within it's process-context.
So when using the multi-process mode "PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE", this method should be overidden and used to open any needed database-connections, file streams or other similar handles to ensure that they will get opened and accessible for every used child-process.
Example:
Initiates a crawler-process
Decides whether the crawler should obey "nofollow"-tags
If set to TRUE, the crawler will not follow links that a marked with rel="nofollow" (like <a href="page.html" rel="nofollow">) nor links from pages containing the meta-tag <meta name="robots" content="nofollow">.
By default, the crawler will NOT obey nofollow-tags.
Decides whether the crawler should parse and obey robots.txt-files.
If this is set to TRUE, the crawler looks for a robots.txt-file for every host that sites or files should be received from during the crawling process. If a robots.txt-file for a host was found, the containig directives appliying to the useragent-identification of the cralwer ("PHPCrawl" or manually set by calling setUserAgentString()) will be obeyed.
The default-value is FALSE (for compatibility reasons).
Pleas note that the directives found in a robots.txt-file have a higher priority than other settings made by the user. If e.g. addFollowMatch("#http://foo\.com/path/file\.html#") was set, but a directive in the robots.txt-file of the host foo.com says "Disallow: /path/", the URL http://foo.com/path/file.html will be ignored by the crawler anyway.
Receives and processes the given URL
Resumes the crawling-process with the given crawler-ID
If a crawling-process was aborted (for whatever reasons), it is possible to resume it by calling the resume()-method before calling the go() or goMultiProcessed() method and passing the crawler-ID of the aborted process to it (as returned by getCrawlerId()).
In order to be able to resume a process, it is necessary that it was initially started with resumption enabled (by calling the enableResumption() method).
This method throws an exception if resuming of a crawling-process failed.
Example of a resumeable crawler-script:
Alias for enableAggressiveLinkSearch()
Sets the timeout in seconds for connection tries to hosting webservers.
If the the connection to a host can't be established within the given time, the request will be aborted.
Sets the content-size-limit for content the crawler should receive from documents.
If the crawler is receiving the content of a page or file and the contentsize-limit is reached, the crawler stops receiving content from this page or file.
Please note that the crawler can only find links in the received portion of a document.
The default-value is 0 (no limit).
Alias for enableCookieHandling()
Sets the basic follow-mode of the crawler.
The following list explains the supported follow-modes:
0 - The crawler will follow EVERY link, even if the link leads to a different host or domain. If you choose this mode, you really should set a limit to the crawling-process (see limit-options), otherwise the crawler maybe will crawl the whole WWW!
1 - The crawler only follow links that lead to the same domain like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will follow links to "http://www.foo.com/..." and "http://bar.foo.com/...", but not to "http://www.another-domain.com/...".
2 - The crawler will only follow links that lead to the same host like the one in the root-url. E.g. if the root-url (setURL()) is "http://www.foo.com", the crawler will ONLY follow links to "http://www.foo.com/...", but not to "http://bar.foo.com/..." and "http://www.another-domain.com/...". This is the default mode.
3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url. E.g. if the root-url is "http://www.foo.com/bar/index.html", the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html", but not links to "http://www.foo.com/page.html".
Defines whether the crawler should follow redirects sent with headers by a webserver or not.
Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
Sometimes, when requesting an URL, the first thing the webserver does is sending a redirect to another location, and sometimes the server of this new location is sending a redirect again (and so on). So at least its possible that you find the expected content on a totally different host as expected.
If you set this option to TRUE, the crawler will follow all these redirects until it finds some content. If content finally was found, the root-url of the crawling-process will be set to this url and all defined options (folllow-mode, filter-rules etc.) will relate to it from now on.
Sets the list of html-tags the crawler should search for links in.
By default the crawler searches for links in the following html-tags: href, src, url, location, codebase, background, data, profile, action and open. As soon as the list is set manually, this default list will be overwritten completly.
Example:
Note: Reducing the number of tags in this list will improve the crawling-performance (a little).
Sets a limit to the number of pages/files the crawler should follow.
If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).
Sets the port to connect to for crawling the starting-url set in setUrl().
The default port is 80.
Note:
effects the same asAssigns a proxy-server the crawler should use for all HTTP-Requests.
Sets the timeout in seconds for waiting for data on an established server-connection.
If the connection to a server was be etablished but the server doesnt't send data anymore without closing the connection, the crawler will wait the time given in timeout and then close the connection.
Has no function anymore.
Please use setWorkingDirectory()
Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
If the limit is reached, the crawler stops the crawling-process. The default-value is 0 (no limit).
Sets the URL of the first page the crawler should crawl (root-page).
The given url may contain the protocol (http://www.foo.com or https://www.foo.com), the port (http://www.foo.com:4500/index.php) and/or basic-authentication-data (http://loginname:[email protected])
This url has to be set before calling the go()-method (of course)! If this root-page doesn't contain any further links, the crawling-process will stop immediately.
Defines what type of cache will be internally used for caching URLs.
Currently phpcrawl is able to use a in-memory-cache or a SQlite-database-cache for caching/storing found URLs internally.
The memory-cache (PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY) is recommended for spidering small to medium websites. It provides better performance, but the php-memory-limit may be hit when too many URLs get added to the cache. This is the default-setting.
The SQlite-cache (PHPCrawlerUrlCacheTypes::URLCACHE_SQLite) is recommended for spidering huge websites. URLs get cached in a SQLite-database-file, so the cache only is limited by available harddisk-space. To increase performance of the SQLite-cache you may set it's location to a shared-memory device like "/dev/shm/" by using the setWorkingDirectory()-method.
Example:
NOTE: When using phpcrawl in multi-process-mode (goMultiProcessed()), the cache-type is automatically set to PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE.
1 -> in-memory-cache (default setting) 2 -> SQlite-database-cache
Or one of the PHPCrawlerUrlCacheTypes::URLCACHE..-constants.
Sets the "User-Agent" identification-string that will be send with HTTP-requests.
Sets the working-directory the crawler should use for storing temporary data.
Every instance of the crawler needs and creates a temporary directory for storing some internal data.
This setting defines which base-directory the crawler will use to store the temporary directories in. By default, the crawler uses the systems temp-directory as working-directory. (i.e. "/tmp/" on linux-systems)
All temporary directories created in the working-directory will be deleted automatically after a crawling-process has finished.
NOTE: To speed up the performance of a crawling-process (especially when using the SQLite-urlcache), try to set a mounted shared-memory device as working-direcotry (i.e. "/dev/shm/" on Debian/Ubuntu-systems).
Example:
Starts the loop of the controller-process (main-process).
Starts the loop of a child-process.
Documentation generated on Sun, 20 Jan 2013 21:18:49 +0200 by phpDocumentor 1.4.4