Site Construction Planning

Utilitiesman 2nd Class Adam Townsend, assigned to Naval Mobile Construction Battalion Five (NMCB 5), Detail Sasebo, surveys a construction site.Having made a decision on the website software (Drupal), and a choice for a web hosting provider (A2Hosting), it’s time to do some detail planning for how to lay down the software on the hosting framework. And, there are a number of options. Since I’m using a shared hosting solution, the first decision is whether or not to have the Drupal software installed and managed using A2Hosting’s auto-installer. That option would give me an easy “one click” installation process; but it also removes detail control of the configuration from my hands, so that isn’t the option I want. I actually want to follow a route that gives me more detail control over how the site will be configured. Of course, that also means I have to take more responsibility for getting the job done.

Requirements

One of my key criteria in the configuration of the hosting environment for architectedfutures.NET is how I expect the platform to be initially used. The reality is that I am starting out with a website where I hope to attract visitors; but which, in essence, is really a place for me to begin to assemble and document my thoughts about my work in a publicly visible, transparent way. The site will be a place for me to put my documentation and notes about my thinking. A place where those notes can be viewed and commented on by others, but a place which may not have a large initial audience. This was the same criteria that told me it was okay to use a shared hosting environment, and not a VPS or more advanced hosting arrangement. This same line of thinking leads me to think about my hosting platform as potentially a multi-functional platform, and not something that needs to be fine-tuned for a dedicated, high-performance application. And that line of thinking influences how I want to configure the site.

From an architectural perspective what I am dealing with is two things, and it is important to understand the difference between them.

  1. architectedfutures.net is a soft entity. From a practical basis it is the information and visualization that appears on a web browser when anyone visits the site. This is independent of the hardware and software platform that may have generated that information.
  2. The A2Hosting web hosting facility is a hard entity. It consists of an actual set of hardware and software configured in a particular way to generate the architectedfutures.net HTML stream.

The first and key requirement for the configuration of the hosting facility is how the architectedfutures.net web presence and the hosting platform come together. architectedfutures.net needs to seem to the user as a complete and independent entity with a clean and exclusive identity. This means that the content should not appear as a sub domain nor should it appear under a visible sub-directory. Any web link or browser reference to “http://www.architectedfutures.net” or to “architectedfutures.net” should directly reference and engage the primary content of the site as delivered by the Drupal software, as though that were the only content on the site. And, as the needs of the architecturedfutures.net website change over time, perhaps to the point of needing to be upgraded to a more robust platform, that change should be able to be accommodated transparently. I should be able to separate the two things, by moving architecturedfutures.net to some other facility as its needs change, and the change in hosting platform should not be visible to the users of the site. Changes in hosting platform should also be transparent to any search engine data base that may have accumulated by that point. Given that as primary, I then have some extra requirements for how I want the hosting facility configured:

  • For purposes of being able to manage site updates, Drupal should in fact be implemented from a sub-directory. This would allow multiple independent Drupal installs to be implemented on the same facility. architectedfutures.net should direct to the “current” or “primary” or “production” implementation. But development.architectedfutures.net, for example, should access a development or test version of the site hosted on the same web hosting facility. (The reason for this, is to be able to test changes to the site, on the site, before the change becomes effective as the default code for public access. Only persons who know the test access path will have access to the test code. The test code should run against a test database, but otherwise be as representative as possible of the production environment. Changes may include styling of new content, changes in presentation, module upgrades, the introduction of new modules, or minor or major upgrades to the Drupal core code.)
  • The primary site should support the implementation of SSL for at least some functions, but the development site(s) may not need or use SSL.
  • I want to have the option of hosting other applications on the site if desired. These may include such simple things as raw HTML or simple PHP code, or some other application like a wiki or a polling system. The extra applications need not be SSL protected. These applications would probably be related in some fashion to architectedfutures.net and could either be addressed as subdomains or as functional sub-areas (sub-directories) of the main site.
  • I want to be able to simply and quickly “nullify” the site at any point, if needed, by replacing the site with a fixed HTML page.
  • I want a fallback, contingency option in the event that Drupal fails to satisfy my needs or requirements. I want to be able to redevelop the site using some other software suite on the facility, and then have a simple cut-over procedure to “flip” the site when it is ready, if that were to become necessary.

Solution

Preamble

Prior to getting into the details of the solution, there is an item of note that should be mentioned. When I started to research how to carry out a solution to my requirements a lot of what I found on the internet were discussions about how to create a Drupal “multi-site” configuration. This is NOT a Drupal “multi-site” configuration. Drupal multi-site is designed to drive multiple Drupal sites from a single code base. It’s created for “sharing” the code. It’s designed to minimize the effort required to keep the Drupal code base for multiple websites up to date and synchronized. The whole idea with Drupal multi-site is that multiple sites are all using the same versions of the same Drupal code. The whole idea here is to NOT share the code, but to have multiple sites, each with their own code. Updating one set of code should have no effect on the other site. This is necessary to run simultaneous different versions of Drupal, or different versions of modules, or different module configurations independently. One instance can then be used as a staging or testing area without impacting what is operating in the other area. The staging “sub domain”  can be the last test stage for a configuration change before the main site is modified. This way the changes actually get tested in the same hardware and software environment as the public site before the change is installed on the primary site.

Overview

The general solution to my requirements is to sub-divide my web hosting resources and to place each distinct version of my Drupal software into its own directory space. The Drupal software will be uploaded and installed multiple times, once in each directory. This increases my maintenance effort, but it will allow me to run multiple different versions of the Drupal code from the same resources. Each installation will be completely and independently configurable. And, because my hosting plan provides for multiple databases, I am going to use a separate database for each Drupal install. The only new requirement this places on any other applications which I may want to run from the same environment is that they must be able to be installed and operated from their own independent sub-directory.

The main document directory for my hosting resource, the document root, becomes my point of control for any global actions I want to take across the entire set of resources.

Addressing

The issue of addressing has two perspectives:

  1. What does a user need to type, technically how does a URI need to be written, to get the data or execute the interaction which the user desires.
  2. What assumptions do search engines and related tools make based on the way URIs for the site are configured.

As soon as I started playing with sub-directories I began down a road where the implications of URI addresses needed to be confronted. (Actually the issue existed before that, but it was a lot less complex.) My understanding is that search engines will treat each domain or sub domain as a separate entity. Sub-directories within a domain are all generally considered to be the same site. I definitely do not want the search engines to assume that everything in my hosting environment is all part of my architectedfutures.net website. I especially don’t want the development/staging site to be confused with the production site. And I don’t want users of architectedfutures.net to need to include a sub-directory name as part of every URI for the site. I may want some of any “add-on” facilities to seem to be part of the main website, but I probably want others to seem to both users and search engines as distinct and separate. This is all about addressing.

So I have a couple of challenges:

  • I want to eliminate the need for the knowledge or use of the sub-directory label as part of the main architectedfutures.net website, and
  • I want to “flag” the content of some sub-directories as being distinct from the main site.

For the components of the site that I want to be viewed as separate, such as the development/staging area(s), I’m going to use subdomains. These will take the form of names like development.architectedfutures.net, test1.architectedfutures.net, etc. Each of these will be given a separate sub-directory under the document root, and for each one I’ll use my hosting provider’s cPanel facility to create a unique sub domain name pointing to the proper sub-directory.

Access Management

This is where the core of the solution happens, in the .htaccess file. Since I’m dealing with a shared hosting environment, I don’t have access to the Apache configuration file for my website. But the .htaccess file allows me to do the specific access management configuration which I need to do. Since the .htaccess code can get confusing, especially the rewrite conditions and rules, I’ll discuss the configuration elements in sections to describe how I’m going about getting the results I want.

For my configuration I am actually making adjustments to more than one .htaccess file. These are discussed in sections below.

Document Root .htaccess File

The following constitutes the adjustments for the .htaccess file located in the document root of the web server. In my case, at A2Hosting, this is in the public_html folder. Directives applied here will also apply to any sub-directories unless they are overruled by an .htaccess file in a lower level directory.

# Turn off indexing so as not to allow browsing of file directories
Options -Indexes
# Configure for redirection
RewriteEngine on
Options +FollowSymLinks

These first few lines in .htaccess can be considered preamble. The lines beginning with ‘#’ are comments. These lines can be eliminated, but I like to leave in the comments to remember why I did certain pieces. The “Options -Indexes” line tells the Apache web server to not generate indexes for web folders which do not have an “index” file. I consider this a good security policy. The “RewriteEngine on” line tells Apache to activate the rewrite engine, the facility that enables rewrite rules. And the last line tells Apache to follow symbolic links. This enables processing for the instructions which come below.

# force use of canonical addressing
RewriteCond %{HTTP_HOST} ^www\.architectedfutures\.net$ [NC]
RewriteRule (.*)$ http://architectedfutures.net/$1 [L,R=301]

These next lines eliminate issues related to search engines or other spiders thinking that http://www.architectedfutures.net and architectedfutures.net might be two websites with duplicate content. Any references to http://www.architectedfutures.net are redirected to architectedfutures.net (without the ‘www’ sub domain). The specification is provided in the form of two statements, a rewrite condition (RewriteCond), followed by a rewrite rule (RewriteRule). Multiple conditions may be joined and applied to a rule, but in this case we are only using one condition: all HTTP_HOST target specifications beginning with ‘www.’ are being redirected to the same specification without the ‘www.’ portion of the URI. The ‘[NC]’ flags at the end of the rewrite condition makes the test case insensitive (upper and lower case differences are ignored). The ‘R=301′ flag on the rewrite rule forces an external redirection request through the requesting agent (the browser or search engine) and provides the agent with an error code ‘301’ on the initial request. This tells the user agent, the browser, to reissue the request to the adjusted name (without the ‘www.’ sub domain) and informs the agent that the ‘www.’ named site has been permanently moved to the renamed site without the ‘www.’ sub domain. The ‘L’ flag indicates that the redirection instruction should be executed immediately and no further .htaccess processing should occur for this request. With this instruction in place, anyone who references our site using a ‘www.’ sub domain should see the site name changed to remove the ‘www.’ portion of the name in the address bar of their browser. Their bookmarks and other references should be updated to remove ‘www.’ prefixes. And search engines will recognize that both names are equal and there is only one site behind both names. (A slightly different form of these instructions could be used if I wanted to standardize on always using the ‘www’ prefix for the site, but my choice is no ‘www’ prefix.)

Next come three sets of rewrite conditions and rules which do the redirection of requests for the primary web domain. This functionality is what I want to have processed by the primary Drupal installation in the ‘drupal’ sub-directory.

# reroute requests for af.net to Drupal
RewriteCond %{HTTP_HOST} ^architectedfutures\.net$ [NC]
RewriteRule ^$ drupal/index.php [L]

This first condition redirects all raw references to ‘architectedfutures.net’ to the ‘drupal’ sub-directory for processing by index.php (i.e., Drupal).

# reroute requests for files to the Drupal directory routing
RewriteCond %{HTTP_HOST} ^architectedfutures\.net$ [NC]
RewriteCond %{DOCUMENT_ROOT}/drupal%{REQUEST_URI} -f
RewriteRule .* drupal/$0 [L]

This next set of conditions adjusts the addressing of all domain relative file requests which reference actual files. They are adjusted to get the file from under the ‘drupal’ sub-directory.

# reroute non-directory, non-file requests to Drupal
RewriteCond %{HTTP_HOST} ^architectedfutures\.net$ [NC]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule .* drupal/index.php?q=$0 [QSA,L]

This third rewrite specification redirects all requests that do not specify actual file or directory requests. They are adjusted to be processed by Drupal (index.php) in the production Drupal sub-directory.

# redirect sub domain directories to their canonical URIs
RedirectMatch 301 ^/sdd/(.*)$ http://sdn.architectedfutures.net/$1

This last directive is really a model and not a real directive in my .htaccess file. I need multiple copies of this RedirectMatch directive in the file, one for each sub domain on my site.

The directive is similar to the canonical addressing rewrite rule identified above for the main website. However, rather than rewriting www prefixed URLs, this directive is used to rewrite all attempts to access the sub-directories which have been set up to support sub domains, and force access through the sub domain names. The sub-directory folders are being blocked from access as parts of the primary domain website. Instead, they are being forced to be accessed via the sub domain URL. One such line is supplied for each sub domain where the ‘sdd‘ literal in the RedirectMatch line identifies the name of the folder or sub-directory, and the ‘sdn‘ portion of the directive defines the sub domain name assigned to the independent website. For example, ‘develop‘ might be a directory containing a testing and development version of Drupal addressed as development.architectedfutures.net, where ‘development‘ is the sub domain name. This allows multiple development sites to be hosted on the same hosting facility depending on my needs. I could also create another instance in a directory called ‘d8′ to support advanced testing of Drupal V8 if I so desired. Or I could create different versions to test different module combinations. Each of these would get their own RedirectMatch line and each would be established as a sub domain using the cPanel facilities.

Primary Domain (/drupal) .htaccess File

Drupal includes an .htaccess file in the root of each directory set where Drupal is installed. Some of the reference resources I have supplied below suggest modifying the Drupal supplied file to eliminate the rewrite directive processing. On review of the file as supplied with Drupal 7 I have not done this. The rewrite directives in the file are in addition to the ones I have identified above, and they appear to make sense and work fine if left alone. I have found no need to adjust the file in the primary Drupal installation.

Sub Domain .htaccess File

For my development sub domain Drupal also creates an .htaccess file (one for each install). Again, this file works fine and supplies Drupal’s standard desired attributes for a Drupal installation. However, for my development directories I did make a change. The file supplies two sets of directives, both commented out, that can be used to force canonical names for the website. I’ve uncommented a set of these directives to standardize my sub domain names. As with the main domain name, my practice is to disallow the ‘www‘ prefix on these sub domains. This change is accomplished in the .htaccess file in each Drupal sub domain.

settings.php

When installing Drupal you create a file named settings.php which you create by copying a file named default.settings.php. Both files are located in the /sites/default directory. The file is Drupal’s site-specific configuration file. Most of the content for this file is created by the Drupal install program based on information you give when you run the install process. I have found it necessary to change one entry in the file after the install process has completed, but only for the primary site.

The file has a commented entry for the base_url variable. It looks like this:

# $base_url = 'http://www.example.com'; // NO trailing slash!

In the settings.php file for the primary site this entry needs to be uncommented and adjusted to reflect the canonical name for the website. In my case, this looks like the following:

$base_url = 'http://architectedfutures.net'; // NO trailing slash!

Without this change I have found that sometimes after executing some Drupal process the address bar of my browser will refer to the home page of my site with an address that includes the directory in which I have Drupal installed. With this change in place, that does not occur.

Again, I have only found this setting required on the primary Drupal install, not on the development install  that is accessed through a sub domain. I suspect the reason for this to be that the sub domains all have rewrite directives which issue 301 error codes when they are addressed this way, but there is no such rewrite directive when the primary domain is addressed this way. (Attempting to create such a rewrite directive tends to generate an infinite rewrite processing loop.) In any case, this eliminates the confusion on the browser address bar for the client.

Robots.txt

The robots.txt file is a special file which resides at the root of a domain to keep search engines and other web crawlers from indexing or accessing those parts of the overall site which we do not want to be available as search engine content. Some documentation on the web implies that these files need to be in the document root directory. The authors then offer various .htaccess rules to manipulate a set of robots.txt files to serve up the correct version of robots.txt depending on which sub domain is being accessed. In my research, including some experimentation with the Google WebMaster tools, I have NOT found this to be necessary. The robots.txt file simple needs to be available for access as though it were in the root of a web domain as a file directly addressed under the domain name as though it were in the root of the domain. For example, as in http://www.example.com/robots.txt, or http://www.sub1.example.com/robots.txt. What this means is that it is possible to avoid special .htaccess rules if the robots.txt files are simply placed appropriately on the site. In effect, the robots.txt files are no different from the Drupal index.php files for each “domain” we are supporting. The robots.txt file for a domain simply needs to be in the relative root for the domain. (It needs to be in the sub-directory of the hosting site which serves as the root for that domain.) My robots.txt for architectedfutures.net  goes in the public-html/drupal sub-directory, and my robots.txt for development.architectedfutures.net goes in the public-html/develop sub-directory.

Primary Domain robots.txt Content

The robots.txt file for my primary domain came as part of my Drupal install package. For the most part, I want to leave this alone. It accurately reflects how I want my website viewed by the search engines. However, I want to add some specifications.

The Drupal supplied robots.txt file assumes that Drupal was installed at the document root of the website, in my case it assumes it is positioned in the public-html directory. However, in my case it is actually in the public-html/drupal directory. The Drupal robots.txt file provides appropriate search engine instructions for everything under the public-html/drupal directory (addressed as architectedfutures.net/something), but it fails to specify anything about the other directories under public-html which may also be addressed as directories under architectedfutures.net. So I want to add entries to the robots.txt file in public-html/drupal to disallow searching in these other directories. The following lines do this:

# For all agents
User-agent: *
# Disallow public-html directories
Disallow: /drupal/
Disallow: /develop/
Disallow: /test1/

For the most part these lines should not be needed. References to these directories should not generally exist outside of my website. However, if any do exist, these lines will keep search engines from using those references to rediscover any areas of my site through non-standard access paths.

Sub Domain robots.txt Content

My development and/or test sub domains which are Drupal installations also include robots.txt files. However, in these cases I don’t really want the sites indexed and documented in search engines. So, for these domains I need to replace the robots.txt files which came as part of the Drupal install package with a new robots.txt file which disallows all access. The following robots.txt replacement file does this task:

# robots.txt
# For all robots
User-agent: *
# Disallow all crawling of the site
Disallow: /

A copy of this file replaces any other supplied robots.txt file and goes in the main sub-directory supporting each of my development and testing sub domains.

Final Note

One last note about site configuration. As a general rule I do not believe in security by obscurity. However it does make sense as a defense in depth measure. I have read multiple tutorials like what I have presented here that discuss using a ‘drupal‘ directory as the name for your primary Drupal install, and using a ‘develop‘ directory as the name for a development or staging area. Just as I have done here. In actual practice these are NOT good names for the associated directories, nor is it a good idea to publish your directory names on a public website, such as what I am doing. That would not be good sense nor would it be a good security practice. One of the benefits of moving your website deeper into your hosting facility is to make it harder to hack the site. If you are going to take the trouble to do that, don’t use names that are easy to guess and don’t publish your directory names. The names supplied above are not the names I use on my site. Don’t use them on your site either!

Resources

The following links identify some of the resources which I found useful in this stage of planning the website.

Books

Web Links

  • drupalsn.com – how to configure Drupal subdomains
  • drupal.org – installing Drupal somewhere other than the root document directory on your web server
  • drupalscout.com – alternative methods for providing SSL privacy protection on a Drupal website
  • Apache documentation – mod_rewrite introduction
  • robotstxt.org – information about robots.txt files
About these ads

Share your thoughts ...

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s