Spidering and Robots

No Comments »

How search engines work?

The search engines maintain a database of web pages on the internet and the links on them. As the web pages keep on changing on the internet by adding/ editing or deleting of pages, the search engines need to update their database so that they can produce better results every time a search is made on the search engine. This recursive reading and following of links on web pages is called “Spidering”. And the program that does this is called Spider or Robot.Now mostly we want our web pages to be spidered by the search engines so that our pages can be listed on search engines. But there are situations when we dont want the spider or robot to crawl some of our web pages. These pages may be like admin panel pages or configure files etc. Morever, sometime you would have your webpages online but the site is not live for the users to visit so you would not like the robots to index the site.

For this purpose, there are two options which direct the robot not to index the specified pages i.e. robots.txt and robot meta tags. Robots.txt is useful when you would like to block lot of pages throughout the site whereas the Robot Meta tags are useful for page to page blockage.

Note: Not all the search engines follow this directive but most of the good search engines follow them.

Robots META Tags

This is a very good method for excluding single web pages. The robot meta tags are used in section of the web page.

The META tags directive shall allow you to exclude single webpages to be indexed and also links on the webpage to be followed. There are 4 options based on index and follow i.e.

1) Do not index, do not follow links.

2) Do not index, follow links.

3) Index this page, but do not follow links.

4) Index this page, follow links.
Robots.txt file

Robots.txt file method is the best method when you would like to exclude visiting search engine robots from complete site or few sections of the website. For example, you might not want a search engine crawl and index the programs in your admin directories. This method is not particularly suited to excluding single pages (see robots META tags instead).

When a robot visits the site http://www.sakshay.in/blog, it first checks for the existence of the file in the root directory of the website.

http://www.sakshay.in/robots.txt

This specially formatted file tells the spider or robot, which parts of the site to exclude. There can be only one robots.txt file per website and it need to be placed in the root directory of the website; for example in /public_html. Thus the file misplaced in some other directory shall not be read and followed like one as below:

http://www.sakshay.in/blog/robots.txt

will not be read.

Now there are many options that can be availed of by using the robots.txt correctly. I have penned down the most commonly used options:

a) If you want to allow all robots to spider your site, put the following in your robots.txt file:

User-agent: *
Disallow:

b) If you want to exclude all robots from your site, put the following in your robots.txt file:

User-agent: *
Disallow: /

c) If you want to exclude all robots from any particular folder, put the following in your robots.txt file.

User-agent: *
Disallow: /blog/
Disallow: /directory/media/
Disallow: /cgi-bin/

So this will exclude all robots from the URLs http://www.sakshay.in/blog/, http://www.sakshay.in/directory/media/, and http://www.sakshay.in/cgi-bin/.

Note: that you need a separate line for each area that you want to exclude from search engine robots. Also note the trailing slash which indicates that the entire directory should be ignored by the search engine robots.

d) If you want to allow one specific robot, but exclude all others, use the following:

User-agent: FriendlySpider
Disallow:

User-agent: *
Disallow: /

Like if you want just the google search engine to crawl your site and exclude other robots, the robots.txt will look like:

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

For more information on robots.txt, please visit http://www.google.com/support/webmasters/bin/answer.pyhl=en&answer=40360

Web SSL digital certificate

No Comments »
Internet has grown exponentially over time. But at the same time, the internet security threats have also grown. There are huge number of hackers ready to steal your personal details, financial data and other secured information.

In order to fight this, a special Internet protocol called SSL (Secure Sockets Layer was created (when speaking of viewing Web pages over SSL, often the term HTTPS is used).

There are many SSL companies offering SSL certificates based on the requirements like Verisign, Thawte, Godaddy etc. Sakshay is one of the resellers of these SSL certificates. The standard web server SSL certificates start from Rs.1000/- to Rs.20,000/- per year for single domain. For more information about Thawte or Verisign certificate and pricing, please visit: http://www.sakshay.in/services_SSL_certificate.html.

301 redirect (permanent redirection)

No Comments »

If you are planning to redirect your site page or main site to some other urls, 301 redirect method is the choice as it is easy to implement and is very search engine friendly. I am penning down few methods for setting up 301 redirect on linux servers using .htaccess. The pre-condition here is that the Apache Mod-Rewrite module should be enabled.

Redirect to www (htaccess redirect)

This method is applicable if you want all requests coming in to domain.com should get redirected to www.domain.com. For example if the site www.sakshay.in is being accessible by both http://sakshay.in and http://www.sakshay.in, the search engine will see this as two separate urls and may mark one as duplicate content. So this method shall redirect the url http://sakshay.in and all pages to http://www.sakshay.in.

The .htaccess file needs to be placed in the root directory of your old website (i.e the same directory where your index file is placed)

Options +FollowSymlinks
RewriteEngine on
rewritecond %{http_host} ^domainname.com [nc]
rewriterule ^(.*)$ http://www.domainname.com/$1 [r=301,nc]

Please REPLACE domainname.com and www.newdomainname.com with your actual domain name.

Again this method requires Apache Mod-Rewrite module should be enabled on the server for the domain.

 

Redirect Old domain to New domain (htaccess redirect)

This method is applicable if you want that all your directories and pages of your old domain should get correctly redirected to your new domain. Again this method shall prevent from marking the pages as duplicate content by search engines.
The .htaccess file needs to be placed in the root directory of your old website (i.e the same directory where your index file is placed)

Options +FollowSymLinks
RewriteEngine on
RewriteRule (.*) http://www.newdomainname.com/$1 [R=301,L]

Please REPLACE www.newdomainname.com in the above code with your actual domain name.

This would take care of the all the old links to be redirected to the new urls. Still it is preferred that in addition to this, your backlinks from other sites be directed to the new urls.

Again this method requires Apache Mod-Rewrite module should be enabled on the server for the domain.

Welcome to Official Sakshay Blog !

No Comments »

Recognized as one of the best software employers in our region, we are able to offer PHP MySQL development services rendered by best professionals in this field. We are more than a company, rather a highly-motivated and devoted team of exceedingly talented software engineers, consultants and managers, highly competent and knowledgeable in custom programming and customer relationship management.

Sakshay is a technology driven web application development and design firm. We are a privately-held and headquartered in India. Our programming expertise, prompt service and strong client support combined with competitive pricing have allowed us to garner a growing recognition in the field of e-business in India, USA, Europe and Canada.

Our core specialization areas are PHP, MYSQL and XML for web application development. Our open source applications are a tremendous learning resource for Coldfusion and PHP MySQL developers. These products are complete and functional applications you can adapt and use as you like. Each application demonstrates a particular set of techniques you can observe and adapt to your own purposes.

Sakshay aims to provide the right strategies and value added services to help our clients accentuate their presence on-line. We cater to their need to succeed in the networked economy. We do this by closely examining the existing systems and introducing solutions that make it more profitable. At Sakshay, we commit our time and resources to providing the most important solutions – those that meet your requirements.