Thursday, 12 July 2012

Regular Expression for .htaccess

Regular expressions are patterns, typically defined in some specific format which server can understand and handle them automatically for string processing.

It was invented and defined by the American mathematician Stephen Kleene.

Regular Expression is mainly used in RewriteRule in .htaccess to manipulate urls.

Before getting into deep First look at some definitions:

literalliteral is any character, which used in a searching or matching expression, for example, to find ind in windows/india the ind is a literal string - each character plays a part in searching, it is literally the string we want to find.
metacharactermetacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression, for example, the character ^ (circumflex or caret) is a metacharacter.
target stringThis term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.
search expressionMost commonly called the regular expression. This term describes the search expression that we will be using to search our target string, that is, the pattern we use to find what we want.
escape sequenceAn escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal, for example, if we want to find (s) in the target string window(s) then we use the search expression \(s\) and if we want to find \\file in the target string c:\\file then we would need to use the search expression \\\\file (each \ we want to search for as a literal (there are 2) is preceded by an escape sequence \).

Now look at some patterns / signs which has specific meaning to them

^ Denotes Beginning of strings. Means starting of string. i.e. Begin arguments with the processing symbol.
ex - "^a" string starting with "a"
$ Denotes end of the string. Its called terminating symbol.
ex. "ab$" means there is no more characters after $.
* Denotes zero or more occurance of preceding symbols.
ex. "ab*" matches a, ab, abb, abbb, ..... string ".*" matches as a wildcard
+ Denotes one or more occurance of preceding symbols.
ex - "ab+" matches ab, abb, abbb, ....
\ Denotes Escape symbols. To determine special/literals symbols like "! @ # $ , ." etc.
\. Denotes "." literal. It is used to match "."
! Denotes negative symbols, means excepts sign, negation sign.
ex - "!ab" matches everything except ab
? Denotes Optional.
ex - "ab?" matches "a" or "ab" and "a(ab)?" matches "a" or "aab"
{} Denotes minimum and/or maximum occurance of preceding symbols.
ex - "ab{3,5} matches strings abbb, abbbb, abbbbb"
        "a{3}" matches occurance of literal "a" exactly three times, means matches string "aaa"
        "a{3,}" matches occurance of literal a minimum 3 times, means it matches "aaa, aaaa, aaaa, aaa..."
() Denotes Grouping. Used to group the symbols/literals in string.
ex - "a(ab)*" matches "a, aab, aabab"
[] Denotes character class. Matches any character within brackets.
ex - [abc] matches "a" or "b" or "c"
[a-z] Here "-" denotes range between a to z. Which is used to denote lowercase letters.
similarly, [a-zA-Z] matches any small and uppercase letters
[0-9] matches any number between 0 to 9
| Denotes pipeline | logical or. Used for logical oring of symbols. ex - "(a|b)" matches "a" or "b".
. Denotes any single character. It is wildcard character
ex - ".*" matches any the character, wildcard for all character
- Denotes range in square brackets.
ex - "[0-9]" matches character between 0 and 9
^$ Denotes empty string. Starting is ending.
\s Denotes white space
-d To test if string is existing directory or not
-f To test if string is existing file or not
-s To test whether file has non zero value or not
Check your regular expression here : Regular Expression Testerpowered by
Usually Flags are added at the end of rewrite rules to tell apache server how to interpret and handle the rule.

[C] Chain - Instruct server to chain with other rules.
[F] Forbidden - Sends 403 header to the user.
[G] Gone - Denotes / gives no longer exist status message.
[H] Handler - Instruct to set handler
[L] Last - Denotes last rule and instruct server to stop rewriting after preceding directory is processed.
[N] Next - Denotes continue to rule until all rewriting directives are processed.
[P] Proxy - Instructs server to handle requests by mod_proxy, i.e., apache should grab the remote content specified in the substitution section and return it
[R] Redirect - Denotes redirect to modified new url.
[CO] Cookie - Set specified cookie
[NC] No Case - Denotes case insensitive. i.e. "No Case"
[NE] No Escape - Instructs the server to parse output without escaping characters.
[NS] No Subrequest - Ignore this rule if request is subrequest
[OR] Logical OR - Ties two expressions together such that either one proving true will cause the associated rule to be applied.
[PT] Pass Through - Instructs mod_rewrite to pass the rewritten URL back to Apache for further processing.
Use when processing URLs with additional handlers, e.g., mod_alias
[QSA] Query String Append - It used to add query string at the end of experssion [URL]
[S=x] Skip - instructs the server to skip the next "x" number of rules if a match is detected.
[E=variable:value] Environmental Variable - Instructs the server to set the environmental variable "variable" to "value".
[T=MIME-type] Mime Type - Force specified Mime Type