Class PHPCrawlerDocumentInfo

Description

Contains information about a page or file the crawler found and received during the crawling-process.

Located in /libs/PHPCrawler/PHPCrawlerDocumentInfo.class.php (line 7)


	
			
Variable Summary
Method Summary
array toArray ()
Variables
array $benchmarks = array() (line 319)

Some internal benchmak-results as array.

  • var: Array containing some interlnal benchmark-results for receiving and processing this document. The keys are the identifiers, the values are the benchmark-times.
  • section: 10 Benchmarks
  • access: public
int $bytes_received = 0 (line 141)

The number of bytes the crawler received of the content of the document.

  • var: Received bytes
  • section: 2 Content-related information
  • access: public
string $content = "" (line 159)

The content of the requested document (html-sourcecode or content of file).

Will be empty if "received" is FALSE and the source won't be complete if "received_completly" is FALSE!

  • section: 2 Content-related information
  • access: public
string $content_tmp_file = null (line 177)

The temporary file to which the content was received.

Will be NULL if the content wasn't received to the temporary file.

  • section: 2 Content-related information
  • access: public
string $content_type = "" (line 149)

The content-type of the page or file, e.g. "text/html" or "image/gif".

  • var: The content-type
  • section: 2 Content-related information
  • access: public
array $cookies = array() (line 193)

Cookies send by the server.

  • var: Numeric array containing all send cookies as PHPCrawlerCookieDescriptor-objects.
  • section: 2 Content-related information
  • access: public
float $data_transfer_rate = null (line 309)

The average data-transferrate for this document.

  • var: The rate in bytes per seconds.
  • section: 10 Benchmarks
  • access: public
float $data_transfer_time = null (line 301)

The time it took to receive the document.

  • var: The time seconds
  • section: 10 Benchmarks
  • access: public
int $error_code = null (line 278)

The code of the error that perhaps occured while requesting/receiving the document.

(See PHPCrawlerRequestErrors::ERROR_... - constants)

bool $error_occured = false (line 269)

Indicates whether an error occured while requesting/receiving the document.

  • var: TRUE if an error occured.
  • section: 8 Error-handling
  • access: public
string $error_string = null (line 286)

A representig, human readable string for the error that perhaps occured while requesting/receiving the document.

  • var: A human readable error-string.
  • section: 8 Error-handling
  • access: public
string $file = "" (line 47)

The name of the requested page or file, e.g. "page.html".

  • section: 1 URL-related information
  • access: public
string $header = "" (line 71)

The complete HTTP-header the webserver responded with this page or file.

  • section: 2 Content-related information
  • access: public
string $header_send = "" (line 86)

The complete HTTP-request-header the crawler sent to the server (debugging info).

  • access: public
string $host = "" (line 31)

The host-part of the URL of the requested page or file, e.g. "www.foo.com".

  • section: 1 URL-related information
  • access: public
int $http_status_code = null (line 185)

The HTTP-statuscode the webserver responded for the request, e.g. 200 (OK) or 404 (file not found).

  • section: 2 Content-related information
  • access: public
array $links_found = array() (line 211)

An numeric array containing information about all links that were found in the source of the page.

Every element of that numeric array contains the following keys again:

link_raw - contains the raw link as it was found url_rebuild - contains the full qualified URL the link leads to linkcode - the html-codepart that contained the link. linktext - the linktext the link was layed over (may be empty).

So e.g $page_data["links_found"][5]["link_raw"] contains the fifth link that was found in the current page. (May be something like "../../foo.html").

  • section: 3 Information about found links
  • access: public
array $links_found_url_descriptors = array() (line 224)

An numeric array containing a PHPCrawlerURLDescriptor-object for every link that was found in the page.

Example: Printing the second raw link that was found on the page

  1.  echo $PageInfo->links_found_url_descriptors[2]->link_raw;

  • var: Numneric array containing PHPCrawlerURLDescriptor-objects
  • section: 3 Information about found links
  • access: public
array $meta_attributes = array() (line 330)

All meta-tag atteributes found in the source of the document.

  • var: Assoziative array conatining all found meta-attributes. The keys are the meta-names, the values the content of the attributes. (like $tags["robots"] = "nofollow")
  • section: 2 Content-related information
  • access: public
string $path = "" (line 39)

The path in the URL of the requested page or file, e.g. "/page/".

  • section: 1 URL-related information
  • access: public
int $port (line 63)

The port of the URL the request was send to, e.g. 80

  • section: 1 URL-related information
  • access: public
string $protocol = "" (line 23)

The protocol-part of the URL of the page or file, e.g. "http://"

  • section: 1 URL-related information
  • access: public
string $query = "" (line 55)

The query-part of the URL of the requested page or file, e.g. "?x=y".

  • section: 1 URL-related information
  • access: public
bool $received = false (line 94)

Flag indicating whether content was received from the page or file.

  • var: TRUE if the crawler received at least some source/content of this page or file.
  • section: 2 Content-related information
  • access: public
bool $received_completely = false (line 105)

Flag indicating whether content was completely received from the page or file.

The conten of the current document may not be received comepletely due to settings made with PHPCrawler::setContentSizeLimit())PHPCrawler::setTrafficLimit().

  • var: TRUE if the crawler received the complete source/content of this page or file.
  • section: 2 Content-related information
  • access: public
mixed $received_completly = false (line 113)

Alias for received_completely, was spelled wrong in prevoius versions of phpcrawl.

  • deprecated:
  • section: 11 Deprecated
  • access: public
bool $received_to_file = false (line 133)

Will be true if the content was received into temporary file.

The content is stored in the temporary file $pageInfo->content_tmp_file in this case.

  • section: 2 Content-related information
  • access: public
bool $received_to_memory = false (line 123)

Will be true if the content was received into local memory.

You will have access to the content of the current page or file through $pageInfo->source.

  • section: 2 Content-related information
  • access: public
string $referer_url = null (line 232)

The complete URL of the page that contained the link to this document.

  • section: 7 Referer information
  • access: public
string $refering_linkcode = null (line 242)

The html-sourcecode that contained the link to the current document.

(E.g. <a href="../foo.html">LINKTEXT</a>)

  • section: 7 Referer information
  • access: public
string $refering_linktext = null (line 261)

The linktext of the link that "linked" to this document.

E.g. if the refering link was <a href="../foo.html">LINKTEXT</a>, the refering linktext is "LINKTEXT". May contain html-tags of course.

  • section: 7 Referer information
  • access: public
string $refering_link_raw = null (line 250)

Contains the raw link as it was found in the content of the refering URL. (E.g. "../foo.html")

  • section: 7 Referer information
  • access: public
PHPCrawlerResponseHeader $responseHeader (line 79)

The complete HTTP-header the webserver responded with this page or file as a PHPCrawlerResponseHeader-object.

  • section: 2 Content-related information
  • access: public
string $source = "" (line 167)

Same as "content", the content of the requested document.

  • section: 2 Content-related information
  • access: public
bool $traffic_limit_reached = false (line 293)

Indicated whether the traffic-limit set by the user was reached after downloading this document.

  • var: TRUE if traffic-limit was reached.
  • access: public
string $url = "" (line 15)

The complete, full qualified URL of the page or file, e.g. "http://www.foo.com/bar/page.html?x=y".

  • section: 1 URL-related information
  • access: public
Methods
setLinksFoundArray (line 337)

Workaround-method, copies and converts the array $links_found_url_descriptors to $links_found.

  • access: public
void setLinksFoundArray ()
toArray (line 357)

Returns an array with all properties of this class.

  • access: public
array toArray ()

Documentation generated on Sun, 20 Jan 2013 21:18:50 +0200 by phpDocumentor 1.4.4