-
$abort_reason
-
Abort reason for aborting the crawling-process.
-
$abort_reason
-
Reason for the abortion of the crawling-process
-
$aggressive_search
-
Specifies whether links will also be searched outside of HTML-tags
-
$auth_password
-
-
$auth_username
-
-
addBasicAuthentication
-
Adds a basic-authentication (username and password) to the list of authentications that will be send with requests.
-
addBasicAuthentication
-
Adds a basic-authentication (username and password) to the list of basic authentications that will be send with requests.
-
addContentTypeReceiveRule
-
Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
-
addCookie
-
Adds a cookie to the cookie-cache.
-
addCookie
-
Adds a cookie to the cookie-cache.
-
addCookie
-
Adds a cookie to send with the request.
-
addCookie
-
Adds a cookie to the cookie-cache.
-
addCookieDescriptor
-
Adds a cookie to send with the request.
-
addCookieDescriptors
-
Adds a bunch of cookies to send with the request
-
addCookies
-
Adds a bunch of cookies to the cookie-cache.
-
addCookies
-
Adds a bunch of cookies to the cookie-cache.
-
addCookies
-
Adds a bunch of cookies to the cookie-cache.
-
addDocumentInfo
-
Adds a PHPCrawlerDocumentInfo-object to the queue
-
addFollowMatch
-
Alias for addURLFollowRule().
-
addLinkExtractionTags
-
Sets the list of html-tags from which links should be extracted from.
-
addLinkPriorities
-
Adds a bunch of link-priorities
-
addLinkPriority
-
Adds a Link-Priority-Level
-
addLinkPriority
-
Adds a regular expression togehter with a priority-level to the list of rules that decide what links should be prefered.
-
addLinkSearchContentType
-
Adds a rule to the list of rules that decide in what kind of documents the crawler should search for links in (regarding their content-type)
-
addLinkSearchContentType
-
Adds a rule to the list of rules that decide what kind of documents should get checked for links in (regarding their content-type)
-
addLinkToCache
-
-
addNonFollowMatch
-
Alias for addURLFilterRule().
-
addPostData
-
Adds post-data to send with the request.
-
addPostData
-
Adds post-data together with an URL-rule to the list of post-data to send with requests.
-
addPostData
-
Adds post-data together with an URL-regex to the list of post-data to send with requests.
-
addReceiveContentType
-
Adds a rule to the list of rules that decides which pages or files - regarding their content-type - should be received
-
addReceiveContentType
-
Alias for addContentTypeReceiveRule().
-
addReceiveToMemoryMatch
-
Has no function anymore!
-
addReceiveToTmpFileMatch
-
Alias for addStreamToFileContentType().
-
addStreamToFileContentType
-
Adds a rule to the list of rules that decides what types of content should be streamed diretly to the temporary file.
-
addStreamToFileContentType
-
Adds a rule to the list of rules that decides what types of content should be streamed diretly to a temporary file.
-
addURL
-
Adds an URL to the url-cache
-
addURL
-
Adds an URL to the url-cache
-
addURL
-
Adds an URL to the url-cache
-
addURLFilterRule
-
Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
-
addURLFilterRule
-
Adds a rule to the list of rules that decide which URLs found on a page should be ignored by the crawler.
-
addURLFilterRules
-
Adds a bunch of rules to the list of rules that decide which URLs found on a page should be ignored by the crawler.
-
addURLFollowRule
-
-
addURLFollowRule
-
Adds a rule to the list of rules that decide which URLs found on a page should be followd explicitly.
-
addURLs
-
Adds an bunch of URLs to the url-cache
-
addURLs
-
Adds an bunch of URLs to the url-cache
-
addURLs
-
Adds an bunch of URLs to the url-cache
-
addURL_Entry
-
-
$child_process_number
-
Number of child-process (NOT the PID!)
-
$class_version
-
-
$content
-
The content of the requested document (html-sourcecode or content of file).
-
$content_length
-
The content-length as stated in the header.
-
$content_size_limit
-
Limit for content-size to receive
-
$content_tmp_file
-
The temporary file to which the content was received.
-
$content_type
-
The content-type of the page or file, e.g. "text/html" or "image/gif".
-
$content_type
-
The content-type
-
$CookieCache
-
The PHPCrawlerCookieCache-Object
-
$cookies
-
All cookies found in the header
-
$cookies
-
Cookies send by the server.
-
$cookies
-
-
$cookie_array
-
Array containing cookies to send with the request
-
$cookie_handling_enabled
-
Flag cookie-handling enabled/diabled
-
$cookie_send_time
-
The time the cookie was send
-
$crawlerStatus
-
-
$crawler_uniqid
-
-
$crawler_uniqid
-
UID of this instance of the crawler
-
$CurrentDocumentInfo
-
Current PHPCrawlerDocumentInfo-object of the current document
-
checkForAbort
-
Checks if the crawling-process should be aborted.
-
checkRegexPattern
-
Checks whether a given RegEx-pattern is valid or not.
-
checkStringAgainstRegexArray
-
Checks whether a given string matches with one of the given regular-expressions.
-
childProcessAlive
-
Checks wehther any child-processes a (still) running.
-
cleanup
-
Cleans up the cache after is it not needed anymore.
-
cleanup
-
Do cleanups after the cache is not needed anymore
-
cleanup
-
Cleans up the crawler after it has finished.
-
cleanup
-
Has no function in this class.
-
clear
-
Removes all URLs and all priority-rules from the URL-cache.
-
clear
-
Removes all URLs and all priority-rules from the URL-cache.
-
clear
-
Removes all URLs and all priority-rules from the URL-cache.
-
clearCookies
-
Removes all cookies to send with the request.
-
clearPostData
-
Removes all post-data to send with the request.
-
containsURLs
-
Checks whether there are URLs left in the cache or not.
-
containsURLs
-
Checks whether there are URLs left in the cache that should be processed or not.
-
containsURLs
-
Checks whether there are URLs left in the cache or not.
-
createPreparedInsertStatement
-
Creates the prepared statement for insterting URLs into database (if not done yet)
-
createPreparedStatements
-
-
createWorkingDirectory
-
Creates the working-directory for this instance of the cralwer.
-
$general_follow_mode
-
The general follow-mode of the crawler
-
$global_traffic_count
-
Global counter for traffic this instance of the HTTPRequest-class caused.
-
getAllBenchmarks
-
Returns all registered benchmark-results.
-
getAllMetaAttributes
-
Returns all meta-tag attributes found so far in the document.
-
getAllURLs
-
Returns all URLs currently cached in the URL-cache.
-
getAllURLs
-
Returns all URLs/links found so far in the document.
-
getAllURLs
-
Returns all URLs currently cached in the URL-cache.
-
getAllURLs
-
Has no function in this class
-
getApplyingLines
-
Function returns all RAW lines in the given robots.txt-content that apply to the given useragent-string.
-
getBaseUrlFromMetaTag
-
Returns the base-URL specified in a meta-tag in the given HTML-source
-
getBasicAuthenticationForUrl
-
Returns the basic-authentication (username and password) that should be send to the given URL.
-
getCallCount
-
-
getChildPIDs
-
Returns alls PIDs of all running child-processes
-
getCookiesForUrl
-
Returns all cookies from the cache that are adressed to the given URL
-
getCookiesForUrl
-
Returns all cookies from the cache that are adressed to the given URL
-
getCookiesForUrl
-
Returns all cookies from the cache that are adressed to the given URL
-
getCookiesFromHeader
-
Returns all cookies from the give response-header.
-
getCrawlerId
-
Returns the unique ID of the instance of the crawler
-
getCrawlerStatus
-
Returns/reads the current crawler-status
-
getDistinctURLHash
-
Returns the distinct-hash for the given URL that ensures that no URLs a cached more than one time.
-
getDocumentInfoCount
-
Returns the current number of PHPCrawlerDocumentInfo-objects in the queue
-
getElapsedTime
-
Gets the elapsed time for the given benchmark.
-
getFromHeaderLine
-
Returns a PHPCrawlerCookieDescriptor-object initiated by the given cookie-header-line.
-
getGlobalTrafficCount
-
Returns the global traffic this instance of the HTTPRequest-class caused so far.
-
getHeaderValue
-
Gets the value of an header-directive from the given HTTP-header.
-
getHTTPStatusCode
-
Gets the HTTP-statuscode from a given response-header.
-
getIP
-
Returns the IP for the given hostname.
-
getLastModified
-
get Last-Modified header
-
getMaxPriorityLevel
-
Returns the highest priority-level an URL exists in cache for.
-
getMetaTagAttributes
-
Gets all meta-tag atteributes from the given HTML-source.
-
getmicrotime
-
Returns the current time in seconds and milliseconds.
-
getNextDocumentInfo
-
Returns a PHPCrawlerDocumentInfo-object from the queue
-
getNextUrl
-
Returns the next URL from the cache that should be crawled.
-
getNextUrl
-
Returns the next URL from the cache that should be crawled.
-
getNextUrl
-
Returns the next URL from the cache that should be crawled.
-
getPostDataForUrl
-
Returns the post-data (key and value) that should be send to the given URL.
-
getProcessReport
-
Retruns summarizing report-information about the crawling-process after it has finished.
-
getRedirectURLFromHeader
-
Returns the redirect-URL from the given HTML-header
-
getReport
-
Retruns an array with summarizing report-information after the crawling-process has finished
-
getRobotsTxtContent
-
Retreives the content of a robots.txt-file
-
getRobotsTxtURL
-
Returns the Robots.txt-URL related to the given URL
-
getRootUrl
-
Returns the normalized root-URL of the given URL
-
getSystemTempDir
-
Determinates the systems temporary-directory.
-
getUrlCount
-
-
getUrlPriority
-
Gets the priority-level of the given URL
-
go
-
Starts the crawling process in single-process-mode.
-
goMultiProcessed
-
Starts the cralwer by using multi processes.
-
$header
-
The complete HTTP-header the webserver responded with this page or file.
-
$header_check_callback_function
-
-
$header_raw
-
The raw HTTP-header as it was send by the server
-
$header_send
-
The complete HTTP-request-header the crawler sent to the server (debugging info).
-
$host
-
The host-part of the URL of the requested page or file, e.g. "www.foo.com".
-
$host
-
-
$host_ip_array
-
Array for caching IPs of the requested hostnames
-
$http_status_code
-
The HTTP-statuscode
-
$http_status_code
-
The HTTP-statuscode the webserver responded for the request, e.g. 200 (OK) or 404 (file not found).
-
handleDocumentInfo
-
get access to all information about a page or file the crawler found and received.
-
handleDocumentInfo
-
Override this method to get access to all information about a page or file the crawler found and received.
-
handleHeaderInfo
-
Overridable method that will be called after the header of a document was received and BEFORE the content will be received.
-
handlePageData
-
Override this method to get access to all information about a page or file the crawler found and received.
-
hostInCache
-
Checks whether a hostname is already cached.
-
$PageRequest
-
A PHPCrawlerHTTPRequest-object for requesting robots.txt-files.
-
$PageRequest
-
The PHPCrawlerHTTPRequest-Object
-
$path
-
The path in the URL of the requested page or file, e.g. "/page/".
-
$path
-
Cookie-path
-
$path
-
-
$PDO
-
-
$PDO
-
-
$PDO
-
PDO-object for querying SQLite-file.
-
$porcess_abort_reason
-
The reason why the process was aborted/finished.
-
$port
-
The port of the URL the request was send to, e.g. 80
-
$port
-
-
$post_data
-
Array containing POST-data to send with the request
-
$post_data
-
Array containing post-data to send.
-
$PreparedInsertStatement
-
Prepared statement for inserting URLS into the db-file as PDOStatement-object.
-
$prepared_statements_created
-
-
$ProcessCommunication
-
ProcessCommunication-object
-
$process_runtime
-
The total time the crawling-process was running in seconds.
-
$protocol
-
The protocol-part of the URL of the page or file, e.g. "http://"
-
$protocol
-
-
$proxy
-
The proxy to use
-
PHPCrawlerCookieCacheBase.class.php
-
-
PHPCrawlerMemoryCookieCache.class.php
-
-
PHPCrawlerSQLiteCookieCache.class.php
-
-
PHPCrawler.class.php
-
-
PHPCrawlerBenchmark.class.php
-
-
PHPCrawlerCookieDescriptor.class.php
-
-
PHPCrawlerDNSCache.class.php
-
-
PHPCrawlerDocumentInfo.class.php
-
-
PHPCrawlerHTTPRequest.class.php
-
-
PHPCrawlerLinkFinder.class.php
-
-
PHPCrawlerProcessReport.class.php
-
-
PHPCrawlerResponseHeader.class.php
-
-
PHPCrawlerRobotsTxtParser.class.php
-
-
PHPCrawlerStatus.class.php
-
-
PHPCrawlerURLDescriptor.class.php
-
-
PHPCrawlerURLFilter.class.php
-
-
PHPCrawlerUrlPartsDescriptor.class.php
-
-
PHPCrawlerUserSendDataCache.class.php
-
-
PHPCrawlerUtils.class.php
-
-
PHPCrawlerDocumentInfoQueue.class.php
-
-
PHPCrawlerProcessCommunication.class.php
-
-
PHPCrawlerMemoryURLCache.class.php
-
-
PHPCrawlerSQLiteURLCache.class.php
-
-
PHPCrawlerURLCacheBase.class.php
-
-
parseRobotsTxt
-
Parses the robots.txt-file related to the given URL and returns regular-expression-rules corresponding to the containing "disallow"-rules that are adressed to the given user-agent.
-
PHPCrawler
-
PHPCrawl mainclass
-
PHPCrawlerBenchmark
-
A static benchmark-class for doing benchmarks within phpcrawl.
-
PHPCrawlerCookieCacheBase
-
Abstract baseclass for storing cookies.
-
PHPCrawlerCookieDescriptor
-
Describes a cookie within the PHPCrawl-system.
-
PHPCrawlerDNSCache
-
Simple DNS-cache used by phpcrawl.
-
PHPCrawlerDocumentInfo
-
Contains information about a page or file the crawler found and received during the crawling-process.
-
PHPCrawlerDocumentInfoQueue
-
Queue for PHPCrawlerDocumentInfo-objects
-
PHPCrawlerHTTPRequest
-
Class for performing HTTP-requests.
-
PHPCrawlerLinkFinder
-
Class for finding links in HTML-documents.
-
PHPCrawlerMemoryCookieCache
-
Class for storing/caching cookies in memory.
-
PHPCrawlerMemoryURLCache
-
Class for caching/storing URLs/links in memory.
-
PHPCrawlerProcessCommunication
-
Class containing methods for process handling and communication
-
PHPCrawlerProcessReport
-
Contains summarizing information about a crawling-process after the process is finished.
-
PHPCrawlerResponseHeader
-
Describes an HTTP response-header within the phpcrawl-system.
-
PHPCrawlerRobotsTxtParser
-
Class for parsing robots.txt-files.
-
PHPCrawlerSQLiteCookieCache
-
Class for storing/caching cookies in a SQLite-db-file.
-
PHPCrawlerSQLiteURLCache
-
Class for caching/storing URLs/links in a SQLite-database-file.
-
PHPCrawlerStatus
-
Describes the current status of an crawler-instance.
-
PHPCrawlerURLCacheBase
-
Abstract baseclass for implemented URL-caching classes.
-
PHPCrawlerURLDescriptor
-
Describes a URL within the PHPCrawl-system.
-
PHPCrawlerURLFilter
-
Class for filtering URLs by given filter-rules.
-
PHPCrawlerUrlPartsDescriptor
-
Describes the single parts of an URL.
-
PHPCrawlerUserSendDataCache
-
Cache for storing user-data to send with requests, like cookies, post-data and basic-authentications.
-
PHPCrawlerUtils
-
Static util-methods used by phpcrawl.
-
prepareHTTPRequestQuery
-
Prepares the given HTTP-query-string for the HTTP-request.
-
printAllBenchmarks
-
-
processHTTPHeader
-
Processes the response-header of the document.
-
processRobotsTxt
-
-
processUrl
-
Receives and processes the given URL
-
purgeCache
-
Has no function in this class.
-
purgeCache
-
Cleans/purges the URL-cache from inconsistent entries.
-
purgeCache
-
Cleans/purges the URL-cache from inconsistent entries.
-
$received
-
Flag indicating whether content was received from the page or file.
-
$received_completely
-
Flag indicating whether content was completely received from the page or file.
-
$received_completly
-
Alias for received_completely, was spelled wrong in prevoius versions of phpcrawl.
-
$received_to_file
-
Will be true if the content was received into temporary file.
-
$received_to_memory
-
Will be true if the content was received into local memory.
-
$receive_content_types
-
Contains all rules defining the content-types that should be received
-
$receive_to_file_content_types
-
Contains all rules defining the content-types of pages/files that should be streamed directly to a temporary file (instead of to memory)
-
$referer_url
-
The complete URL of the page that contained the link to this document.
-
$refering_linkcode
-
The html-sourcecode that contained the link to the current document.
-
$refering_linktext
-
The linktext of the link that "linked" to this document.
-
$refering_link_raw
-
Contains the raw link as it was found in the content of the refering URL. (E.g. "../foo.html")
-
$refering_url
-
The URL of the page that contained the link to the URL described here.
-
$responseHeader
-
The complete HTTP-header the webserver responded with this page or file as a PHPCrawlerResponseHeader-object.
-
$resumtion_enabled
-
Flag indicating whether resumtion is activated
-
$resumtion_enabled
-
Flag indicating whether resumtion is activated
-
$RobotsTxtParser
-
The RobotsTxtParser-Object
-
readResponseContent
-
Reads the response-content.
-
readResponseHeader
-
Reads the response-header.
-
registerChildPID
-
Registers the PID of a child-process
-
reset
-
Resets the clock for the given benchmark.
-
resetAll
-
Resets all clocks for all benchmarks.
-
resetLinkCache
-
Resets/clears the internal link-cache.
-
resume
-
Resumes the crawling-process with the given crawler-ID
-
rmDir
-
Deletes a directory recursivly
-
$socket
-
The socket used for HTTP-requests
-
$socketConnectTimeout
-
Timeout-value for socket-connection
-
$socketReadTimeout
-
Socket-read-timeout
-
$source
-
Same as "content", the content of the requested document.
-
$SourceUrl
-
The URL of the html-source to find links from
-
$source_domain
-
The domain the cookie was send from
-
$source_url
-
The URL the cookie was send from
-
$source_url
-
The URL of the website the header was recevied from.
-
$sqlite_db_file
-
-
$sqlite_db_file
-
-
$sqlite_db_file
-
-
$starting_url
-
The full qualified and normalized URL the crawling-prpocess was started with.
-
$starting_url
-
The URL the crawler should start with.
-
$starting_url_parts
-
The URL-parts of the starting-url.
-
sendRequest
-
Sends the HTTP-request and receives the page/file.
-
sendRequestHeader
-
Send the request-header.
-
serializeToFile
-
Serializes data (objects, arrayse etc.) and writes it to the given file.
-
setAggressiveLinkExtraction
-
Alias for enableAggressiveLinkSearch()
-
setBaseURL
-
Sets the base-URL of the crawling process some rules relate to
-
setBasicAuthentication
-
Sets basic-authentication login-data for protected URLs.
-
setConnectionTimeout
-
Sets the timeout in seconds for connection tries to hosting webservers.
-
setContentSizeLimit
-
Sets the content-size-limit for content the crawler should receive from documents.
-
setContentSizeLimit
-
Sets the size-limit in bytes for content the request should receive.
-
setCookieHandling
-
Alias for enableCookieHandling()
-
setCrawlerStatus
-
Sets/writes the current crawler-status
-
setFindRedirectURLs
-
Specifies whether redirect-links set in http-headers should get searched for.
-
setFollowMode
-
Sets the basic follow-mode of the crawler.
-
setFollowRedirects
-
Defines whether the crawler should follow redirects sent with headers by a webserver or not.
-
setFollowRedirectsTillContent
-
Defines whether the crawler should follow HTTP-redirects until first content was found, regardless of defined filter-rules and follow-modes.
-
setHeaderCheckCallbackFunction
-
-
setLinkExtractionTags
-
Sets the html-tags from which to extract/find links from.
-
setLinkExtractionTags
-
Sets the list of html-tags the crawler should search for links in.
-
setLinksFoundArray
-
Workaround-method, copies and converts the array $links_found_url_descriptors to $links_found.
-
setPageLimit
-
Sets a limit to the number of pages/files the crawler should follow.
-
setPort
-
Sets the port to connect to for crawling the starting-url set in setUrl().
-
setProxy
-
Assigns a proxy-server the crawler should use for all HTTP-Requests.
-
setProxy
-
-
setSourceUrl
-
Sets the source-URL of the document to find links in
-
setStreamTimeout
-
Sets the timeout in seconds for waiting for data on an established server-connection.
-
setTmpFile
-
Has no function anymore.
-
setTmpFile
-
Sets the temporary file to use when content of found documents should be streamed directly into a temporary file.
-
setTrafficLimit
-
Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
-
setUrl
-
Sets the URL for the request.
-
setURL
-
Sets the URL of the first page the crawler should crawl (root-page).
-
setUrlCacheType
-
Defines what type of cache will be internally used for caching URLs.
-
setUserAgentString
-
Sets the "User-Agent" identification-string that will be send with HTTP-requests.
-
setWorkingDirectory
-
Sets the working-directory the crawler should use for storing temporary data.
-
SMCCrawler
-
Loading external PHPCrawler-class
-
sort2dArray
-
Sorts a twodimensiolnal array.
-
splitURL
-
Splits an URL into its parts
-
starControllerProcessLoop
-
Starts the loop of the controller-process (main-process).
-
start
-
Starts the clock for the given benchmark.
-
startChildProcessLoop
-
Starts the loop of a child-process.
-
stop
-
Stops the benchmark-clock for the given benchmark.
-
$url
-
The complete, full qualified URL of the page or file, e.g. "http://www.foo.com/bar/page.html?x=y".
-
$urlcache_purged
-
Flag indicating whether the URL-cahce was purged at the beginning of a crawling-process
-
$UrlDescriptor
-
The URL for the request as PHPCrawlerURLDescriptor-object
-
$UrlFilter
-
The UrlFilter-Object
-
$urls
-
-
$url_cache_type
-
URl cache-type.
-
$url_distinct_property
-
Defines which property of an URL is used to ensure that each URL is only cached once.
-
$url_filter_rules
-
Array containing regex-rules for URLs that should NOT be followed.
-
$url_follow_rules
-
Array containing regex-rules for URLs that should be followed.
-
$url_map
-
-
$url_parts
-
The parts of the URL for the request as returned by PHPCrawlerUtils::splitURL()
-
$url_priorities
-
-
$url_rebuild
-
The complete, full qualified and normalized URL
-
$userAgentString
-
The user-agent-string
-
$UserSendDataCache
-
UserSendDataCahce-object.
-
$user_abort
-
Will be TRUE if the crawling-process stopped because the overridable function handleDocumentInfo() returned a negative value.
-
updateCrawlerStatus
-
Updates the status of the crawler
-
URLHASH_NONE
-
-
URLHASH_RAWLINK
-
-
URLHASH_URL
-
-
urlHostInCache
-
Checks whether the hostname of the given URL is already cached
-
urlMatchesRules
-
Checks whether a given URL matches the rules.