Sitemap Creator 0.2 beta

New Version is available. Click here

Creator crawls/ your website creating compatible with the standard .org protocol supported by Google, Yahoo!, MSN and MoreOver. The script pings Google, Yahoo!, MSN and MoreOver bots to download the sitemap , then tracks the bot and sends you an email on every scan to your Sitemap and gives you a full report of the respond.
Sitemaps are created from a CSV file which could easily be edited using any text editor before creating the sitemap. Sitemap Creator has three built in ranking mechanism which decide priorities of your pages depending on the number and the placement of link backs, crawled first links or URL structure. You can also the crawler by , run time or number of URLs.

    beta info
  • is not needed anymore, all requests are processed through fsockopen.
  • Big fix to the link back ranking functions.
  • Limit crawler to a number of URLs
  • Disable crawling specific directories or links. Regular expressions are supported
  • limit number of links to on start page

The script was tested on PHP5, let me know how it worked for PHP4 .
Online demo might be available on the next release.
Download (build 20080514) :

  • Pingback: Sitemap Creator 0.2 beta Released « jared.brodsky()

  • What is the URL you are supposed to submit to google for the sitemap?

  • @gtnman
    Please use “Add reference to robots.txt” to show you the default sitemap URL, remember to chmod 666 robots.txt .
    it should look something like

  • Interesting note. I have tested this script on two VPS servers both running the latest version of Plesk on two different web hosts. Seems that the unix command utime (which touch()) uses is not available in Plesk. Is touch vital to the script or will is_writable do the same thing? (I made this change, and the robots.txt generated perfectly.)
    Does the cronjob re-generate robots.txt or is this something that the user must do.

  • Another note, I tested this on my ubuntu box and it worked fine which utime did exist on.

  • @gtnman
    – touch() is not related to Plesk, you need to have permissions to modify files in that directory, I can not find any relation between touch() function and utime. make sure you ‘chmod 777’ data directory.
    – robots.txt are not generated every time the sitemap is created, you only need to modify it once.

  • hi
    help me
    i see this error:

    No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host

  • where is the right setting for site

  • @abyzn
    Can you enable the debug mode from the configuration file and give me the results?

  • I use your Sitemap Creator for a site and all going OK, show a table with the URL’s and no error messages. But when I click “Create Sitemaps” the XML is mal formed and incorrect. Any suggestion?
    The Site is in ISO-8859-1 and Sitemap in UTF-8…

  • @Ferran
    can you give me a like to your sitemap xml file?

  • Pingback: links for 2008-06-18 « Free Open Source Directory()

  • @Ferran
    can you change constant SMC_GSS to false to disable GSS and try again?

  • @wkarim
    I set to false SMC_GSS and the result is the same…
    When I look at CSV or the table of sitemap.php show the links… It may be the codification of the files? I try in my sites in UTF and works perfectly, but when i try in this site that’s in ISO, the xml is corrupt… I don’t understand it, the functions are correct and config too…

  • Thanks wkarim, works well using this version.

    I use it on my site

    anytime search engine robot read my sitemap, email sent to my mailbox inform about their activity.

  • flobster

    nice script! just one question: i have several urls ending with “?a=page:2” which i want to exclude from being crawled. Will this be possible using SMC_DISABLED_DIRS?

  • @flobster
    yes, you can use regular expressions to disable those URLs, examples are included on the config file.
    ex : define(‘SMC_DISABLED_DIRS’, ‘\?a=page:2’ );

  • Bo

    i get this error:
    WARNING: Page is redirecting to home.php
    DEBUG: URL “” Blacklisted. Reason : Empty Page

    can you please advice?
    thank you in advance

  • @Bo
    it should continue crawling the site normally but will not show on the sitemap as it’s a redirect page

  • Installed, crawls fine after logging in via

    Once crawl has finished I press Create Sitemaps and get Please crawl first.

    Sevrer is running PHP 5.2.6

    Any ideas?


  • Same here i upgrade to the new version and still no luck when i click the Crawl link it displays a number then it does nothing, now if i click any other link it shows that the cache has around 36 elements but nothing else no list nothing, dont know what else to do

  • i been checking the log file and the only error in it is this one:

    [05-Aug-2008 15:06:56] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /home/xxxxxxx/public_html/sitemap/ on line 295

  • @mikey, SearkeCom
    Please enable debug mode to give me a better idea about what is happening.

  • king

    I’m running php4 and 0.2b and can not get it to run. I have got to the point where everything runs fine, but google and others returns errors. the below one is saying the file is not present i 20080813 and there is no file

    /hsphere/local/home/mgruppe/ could not be found IP -:
    Date -: 04:26:44 pm ( Wednesday 13 August 2008 ) Bot -: Mozilla/5.0 (compatible; Googlebot/2.1; + Location -:

    if i go to
    it returns that something is wrong with the xml file

    file attibutes should be ok
    Thanks for a nice script!


  • @king
    looks like you pinged google with a sitemap then deleted it, try ping google again with the right sitemap name or add it manually on google webmaster tools

  • I installed it on my site and worked fine so far, i created a sitemap and managed to ping all search engines.

  • I have tried this scrip, unmodified on php4 and i get the following message when i run the script:
    WARNING: Connection failed (111) Connection refused
    DEBUG: URL “” Blacklisted. Reason : Empty Page

    The page i presume it’s refering to is index.php and i know its not empty. Can you shed any light on this please?
    MAny thanks

  • @Stuart
    please check your connection from the server running the script to, check what you get with < ?php echo file_get_contents( ''); ?>

  • Hi,
    I created a new file within the sitemap directory called test.php and copied in the above short script, removing the space after the first < in your example. The results are as follows:
    Warning: file_get_contents( [function.file-get-contents]: failed to open stream: Connection refused in /home/sites/ on line 1

  • @Stuart
    Thre must be something wrong with your connection

  • jorge

    When I enter to my page they ask me for a loggin.
    Where can I find it?

  • @Jorge
    the password is ‘demopass’
    change it to any password you like in the config file

  • Antonimo

    After crawling the site, the page displays all of the urls etc. At the end of the page is the URL to add to the crontbab.

    The page then shows a link to “Create Sitemaps” – The sitemap has a date stamp. Is the XML sitemap created automatically when it is run from the crontab? Does it have a new time stamp?

    If the Ping is enabled in the config file, will the correct sitemap be “pinged” to the search engines?

  • @Antonimo
    Yes, all sitemaps are time stamped and the ping is for the newest created one.
    if the request does not contain a time stamp then the script will send the most recently created sitemap

  • Antonimo

    Thanks for the quick response.
    Cron set up – looking forward to seeing the results.
    Excellent script – kudos

  • Antonimo

    Hi again,

    I am having great difficulty getting the cron to work.

    I have other crons that run fine using the command /usr/bin/php -q /home/DOMAIN/public_html/BackUp.php (for example)

    Whe I try to run the sitemap creator from cron, (/usr/bin/php -q /home/DOMAIN/public_html/sitemap/sitemap.php?do=createsitemap) I receive an error, “No input file specified”

    Is there a special way to do this or can I call the “do” script from another php file?

  • @Antonimo
    The script is not designed to run from command line, alternatively you can use lynx with crontab…the command should look something like
    lynx -dump

  • Dave

    This looks great but idea why I get this error and one one page ?

    Warning: Division by zero in /myhost/public_html/sitemap/ on line 320

    Finished crawling, Crawled 1 links

  • @Dave
    I can not investigate the problem without knowing your domain, you might want to make sure no redirects are made

  • antiwow

    I get a message after crawling,

    Crawler Timed out after 200 seconds while crawling www. etc

    After that, it wont let me generate a sitemap

    i looked at config, but couldnt find anything

    any ideas?

  • @antiwow
    Try to enable the debuging mode from the config file

  • Hello… Is the best free sitemap creator… The cron function haw is stoped or restarted? How can i put a cron job for my site to make another sitemap mounthly? I insert the link generated at end of crow in cron job from cpanel?

  • @Luciffere
    Thanks for the comment
    You need to refer to Cpanel documentation for creating cron job, for UNIX crontab you may use Lynx command line browser to run the script.

  • After the script finished the job, he tell me put this link on cron job…
    My question is if i put this link ( generated by your script, in my cron job and set mounthly Saturday 1 12:00 AM… the script make sitemap in every mounth in first day at 12:30 AM and ping the google, yahoo… etc… ? I need to login at to insert the link to the sitemap? Now i have at google a manual sitemap but i deleted. Is necesary insert to google a link manualy?
    Thank’s for your work….

  • @Luciffere
    you can add “” to Google webmaster tools. For MSN webmaster tools please create a rewrite rule for this URL…it should look like “”

  • Your script very cool. But some webmasters need to correct var max_execution_time in php.ini or in htaccess file.

  • I´ve been using your SMC a while. Nice work.
    I just realized that the SMC cuts an url where a space appears instead of inserting %20. This breaks my links and the sitemap points to a number of pages having mysql-errors from incompleate url.
    Is there more than function: cleanurl() to correct this?
    Is it about text encoding? I´m using iso-8859-1.

  • @Peter
    Anchor Links should have escaped URI, that means you should replace any spaces on your links with %20

  • Hello. I use your scrip, and he generate the sitemep, but google tell me: Unsupported file format
    Your Sitemap does not appear to be in a supported format. Please ensure it meets our Sitemap guidelines and resubmit.

    Address of my sitemap is:

  • The url sent to searchengines seems to be wrong.
    In my case it miss the directory: sitemap_creator_0.2a/
    At least the url returned from moreover to my email.
    Has all default settings exept SMC_SITE where $_SERVER[‘HTTP_HOST’] doesn´t work. Replaced with ‘’

  • Think I got it working. Found the 0.2b and some reading.

    For editing: is it only the CSV I need to edit?
    What about the: data/sites/klassiskabyggvaro.se_sitemaps/20090114

    Thanks for a good script and active blog!

  • @luciffere
    you should submit instead (note that no “sitemap” directory in the URL )
    You need to edit the CSV before creating your sitemap in case you like to change something

  • Pedro

    Hi, I have installed the script and apparently works very fine (last version). I have Php 4x

    I have some doubts.
    A. I need to crawl first and after

    B. What time interval for cron is the best option?


  • Pedro

    Another doubt.
    The 99% priorities of my sitemap are 0.1. I need to change manually the priorities?


  • Pedro

    I think lynx -dump command dont work in my cronjobs. Any alternatives?

  • @Pedro
    – I think cron job could run every week
    – priorities depends on your link structure, you might need to check your SEO
    – You whether install lynx ( yum install lynx ) or you can use curl

  • Hello again… I want make sitemap with your script for but the script craw only 13 links… Many links who are linked from first pages are listed with 403 code but all work… Where are the problem?

  • Rod

    having this error massage: No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host

    can you advise me my code is as below:
    ‘Yahoo’ => ‘’,
    ‘Live Search’ => ‘’,
    ‘’ => ‘’,
    ‘MoreOver’ => ‘’,


  • Rod

    i get this error can you help :

    No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host

    my config is:

    ‘Yahoo’ => ‘’,
    ‘Live Search’ => ‘’,
    ‘’ => ‘’,
    ‘MoreOver’ => ‘’,


  • I found no major problems with your script – installing or other! I was wondering if IE7 defeats the timeout in setcookie (smc_pass)?
    I have altered the time and yet to see it not show a new login page. I can work it by closing the browser. Do you have a “logout” feature or must I always close the browser?
    Thanks for a neat script!!

  • @Scooter
    the cookie should be deleted if browser closed, try clear your cookies if you want to logout

  • Harold

    I have successfully used Sitemap Creator 0.2 beta for over 2 months without any problems. Just recently Google can no longer read my sitemap and when I personally visited my sitemap it says

    “XML Parsing Error: no element found
    Line Number 1, Column 1:”

    Any ideas?

  • ZDN

    giving SMC a whirl and noticing some problems and have some suggestions…

    this is on LAMP with PHP v5.2.8 and a website of ~700 pages


    define(‘SMC_USE_BLACKLIST’, true);

    -if set to false, SMC is unable to crawl at all (using http://[domain]/sitemap/sitemap.php?do=crawl). this is repeatable 100%

    if i change back to ‘true’, SMC fails to crawl and says “No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host”. i’m a bit foggy here, but i think i had to delete the SMC cache to get it to run again.


    -ability to edit priority and frequency from browser instead of CSV file, and ability to select and edit multiple files at once.

    -ability to make priority and frequency static instead of being changed when re-crawled. ability to set per directory would be good here too.

    -ability to set default priority and frequency for a directory, or a directory and all sub directories, so any new or existing files found in these directories during a crawl or by a future crawl will inherit the same priority/freq.

    -meaningful names for blacklist directory files or, better yet, new link for sitemap.php that shows what was blacklisted

    -default protection for /sitemap/ directory so public can’t access

    i’ve been looking for quite a long time for a sitemap script that can auto-ping and auto-update (CRON) and i like what you have done. looks VERY promising!

  • Hi Karim, thanks a lot for this awesome script. I did install with only little adoptions and works like a charm. Crawled 7395! pages on my site (time limit: 7000, memory: 600) without a hickup. Sitemap successfully published to the different search engines as well.

    Only issue right now is the created sitemap only contains 500 pages. 🙁 Checked the config a couple of times, but found nothing. Am I missing something?


  • @ZDN
    thanks for the suggestions, I’ll consider them in new releases
    can you check the CSV file and let me know how many URLs does it have?

  • Hi, thanks for creating this great sitemap creator. It is the best one I have used. I do have a couple of things, please don’t consider them criticisms they are not.
    1. I have a htaccess redirect on my site so all is rewritten to . This does cause a problem with the emails. Its not a real problem but I guess I am not the only one who uses mod rewrite, maybe its something to consider in future versions?
    2. I have one page that has a space in the page name and that is caught in the Blacklist. I must admit the page name with a space was an error of mine many years ago and I have never got round to fixing it so its my fault with bad page naming.

    So thanks very much for a great sitemap creator.


  • @Blackpool
    thank you for your comments
    1. you will need to add your domain with ‘www’ on the config file
    2. you need to add a redirect header for that URL to a new URL without spaces, and replace old one on all your pages. check google webmaster guidelines for more info

  • Hi Karim, thanks a lot for this awesome script.
    It required to modify my website because of ./ links will recursive call crawling. I suggest you to remove (trim) /./ urls.
    If a url is href=”./pagename.html” (that I use for redirect to index.html) don’t add domain-base url but link to http://pagename.html
    I hope this help other people!
    Best regards,

  • cdl

    The script runs great, no problems at all – Thanks for all the work.

    My problem is getting a cron job to run it. My host uses cPanel X. I have tried everything I can think of and either get:
    1. No input file specified.
    2. The complete text of the script.

    Here’s the “Command to run” I’m using:
    php /home/DOMAIN/public_html/sitemap/sitemap.php?do=createsitemap&secure=16312581c5c30b78630b89d2205e8675

    Anyone work this out on cPanel X?

  • @mauro
    I will investigate this more with next releases
    you need a program like lynx or cURL to execute the script, it can not be executed from php cli

  • I got questions about this script.

    1 – there’s a way to filter some directories or files? (eg: /directoryname/*)

    2 – I’ve put define(‘SMC_SITE’, $_SERVER[‘’] ); but I still got the error saying “No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host”.

  • First: your script is a realy nice work, thanks for that stuff!

    I had a similar bug like mauro. I solved it with the following patch. Maybe this helps someone. – arround line 288:
    /** original **/
    $url = preg_replace(‘#/[^/]*$#’,’/’,$url);

    /** replace with this **/
    $url = str_replace(“http:”, “”, $url);
    $url = str_replace(“/”, “”, $url);
    $url = SMC_SCHEME. $url .’/’;

    And if you are already coding you may add something to line 252
    /** original **/
    if( !empty($sub) && preg_match(‘#\.(ico|png|jpg|gif|css|js)(\?.*)?$#i’, $sub) ) #excluding graphics
    /** after change **/
    if( !empty($sub) && preg_match(‘#\.(ico|png|jpg|gif|css|js|pdf|doc|eps)(\?.*)?$#i’, $sub) ) #excluding graphics and other none html documents

    A user defined filter setting to rip vars like &uid=999 from the url would be nice in next release.

  • $url = str_replace(‘http:’, ”, $url);
    $url = str_replace(‘/’, ”, $url);
    $url = SMC_SCHEME. $url .’/’;

  • simon

    this is the best script I have used.
    Thanks for it.

    Could you do a meta,keyword one that would be good?

    or does anyone know if one that crawls your site to make them like this script crawls your site.

  • is there a way to get all url in javascript like document.location=’http://…’

  • @yannick
    I am afraid the crawler can not understand JavaScript.

  • hotep

    Thank you for the excellent script! 🙂

    Anybody know the best parameters to include when trying to parse phpBB version 2?

    The just script hangs when trying to crawl the bulletin board.

  • Jim

    Every time I attempt to crawl my site with your program I get a “Zero Size Reply”. What is this telling me and how do I fix it?


  • Warning: Division by zero in /var/www/sitemap/ on line 320

    Help please 🙁

    Using lighttpd.

  • Hi Karim, thanks a lot for this awesome script. I did install with only little adoptions and works like a charm. Crawled 7395! pages on my site (time limit: 7000, memory: 600) without a hickup. Sitemap successfully published to the different search engines as well.

    Only issue right now is the created sitemap only contains 500 pages. 🙁 Checked the config a couple of times, but found nothing. Am I missing something?


  • hi thx for u sharing i need it …

  • I want to craw the but:
    No Pages were crawled, Please make sure you have set your site domain correctly and you have valid connection to host
    The script works in the past. I don’t made any modification. What happened?
    Thank you.

  • Hi dude, I have a question.

    Why after 44 seconds my crawler redirect for page 404 not found?
    Have somebody any explication?


  • Why the crawler did takes my sub-domain?
    I just specified the first domain :S

  • So great, I liked so much, thx!!

  • Terry

    Hi there, I think the script is great – so far so good. I just cant get it to crawl deeper than the root directory It will not look into sub directory folders eg Any ideas anyone?

  • Terry

    It’s ok I found out why sub folder were not being crawled!
    There has to be links to them first – silly me.

  • Terry

    whether not weather

  • peps

    Thanks for this great script!! Works great!

    How do i add .pdf & .docx to the allowed file types so that they will be added to the sitemap?


  • peps


  • peps

    still figuring out how.. anyone?
    i get this notice:
    NOTICE: Document type is application/pdf for URL