Bioinformatics Tools: March 2014

Saturday, March 8, 2014

Grep Tutorial

This is a reference booklet for grep and regular expression. For explanation of various usages in detail please refer more elaborate guide. grep: Global Regular Expression Print. GNU grep is combination of basic regular expressions, extended regular expressions, fixed strings and Perl-style regular expressions. Default behavior of grep is to return the filename and the line of the test that contains the searched string. Literals are the normal text characters, whereas metacharacters have special meanings. Backtic (``) enclosed portion is interpreted. Double quotes (“”) allow usage of environment variable as a part of search pattern.

There are two ways to search with grep i.e. searching for fixed string and searching for patterns. Concatenation is processed before alternation. Strings are concatenated by simply placing/being next to each other inside regular expression.

grep -E has advantage of accomplishing the task in fewer characters. If significant use of backreferences is required, grep -E is ideal.

grep -F, any search pattern for grep -F cannot contain any metacharacters, escapes, wildcards, or alternations.

Syntax usage of grep is as follows: grep [options] [regularexpression] [filename]

Example: grep -n 'error' logfile.txt

Metacharacter	Name	Matches
Single Character Match
.	Dot	Any one character
[…]	Character class	Any one member of the character listed in brackets
[^…]	Negates character class	Any character not listed in bracket (any one)
\char	Escape character	Use the character after escape (\) literally (not interpreted).
Position Match
^	Caret	Start of a line
$	Dollar	End of a line
\<	Backslash (less-than)	Start of a word
\>	Backslash (greater-than)	End of a word
Quantifiers
?	Question mark	Optional match (any single character)
*	Astrisk	Any number of occurrence including zero, wild card
+	Plus	One or more of preceding expression (repetitive match)
{N}	Exactly match	Match exactly N times
{N,}	Match at least	Match at least N times
{min,max}	Specified range	Match minimum and maximum times i.e. {3,4}

\|	Alternation	Match either of the expression given
-	Dash	Range
(…)	Parenthesis	Used to limit scope of alternation (sub pattern)
\1, \2, \3, …	Backreferences	Matches text previously matched within parenthesis
\b	Word boundary	Matches characters or words marked by the end of the word, i.e. space, period
\B	Backslash	Used for matching \ backslash same as \\
\w	Word character	Used for matching any word character, i.e. letter, number or underscore
\W	Non-word character	Used for matching anything considered not-word i.e. other than letter, number and underscore
\`	Start of buffer	Start of buffer sent to grep
\’	End of buffer	Matches the end of buffer sent to grep

POSIX definition
[:alpha:]		Any alphabetical character
[:digit:]		Any numerical character
[:alnum:]		Any alphabetical or numerical character
[:blank:]		Space or tab character
[:xdigit:]		Hexadecimal character
[:punct:]		Any punctuation symbol
[:print:]		Any printable character (not control characters)
[:space:]		Any white space character
[:graph:]		Excludes whitespace character
[:upper:]		Any uppercase letter
[:lower:]		Any lowercase letter
[:cntrl:]		Control character

Basic regular expression	grep or grep -G
-e	-e pattern	Recognizes pattern as regular expression argument i.e. grep -e -style (matches -style)
-f	-f file	Takes patterns from file. The pattern file must list one pattern per line.
-i	-i (ignore case)	Case insensitive search
-v	-v (invert match)	Returns lines that do not match pattern
-w	-w (word boundary match)	Matches exact word with boundary.
-x	-x (line match)	Matches entire line ‘Hello, World!’
-c	-c (counts)	Counts the number of matching lines
-l	grep -l “error” *.txt	Prints files containing the pattern, stops at first match
-L	grep -L “error” *.txt	Prints files that do not contain the pattern, stops at first match
-m num	grep -m 10 “error” *.txt	Stops reading file after num lines are matched i.e. only 10 lines that contain regular expression
-o	grep -o pattern filename	Prints only the text that matches
-q	quite	Suppresses output
-s	silent, no messages	Silently discards any error messages resulting from permission errors or non-existent files
-b	byte offset	Displays byte offset of each matching text instead of line number
-H	with filename	Includes the name of the file before each line printed (default when more than one file is input)
-h	no filename	when more than one filename is given it suppresses printing the filename before each output
--label=LABEL	adds label	It will prefix the line with LABLE
-n	line number	Includes the line number of each line displayed.
-T	initial tab	Inserts a tab before each matching line
-u	Unix byte offsets	Computes the byte offset as if it were running under Unix system
-z	null	Prints ASCII NUL (a zero byte) after each filename
-A num	after context = num	Prints num (number of lines) after match
-B num	before context = num	Prints num (number of lines) before match
-C num, -num		Prints num (number of lines before and after match
-R or -r	recursive	Searches files underneath directory submitted as an input file i.e. grep -R pattern path
Extended Regular Expressions	egrep or grep -E
?		Any character preceding ? may or may not appear in the target string.
+		Unlimited number of repetitions while looking for matching string, i.e. grep -E ‘regex1+’ filename (will look for regex1, regex11, regex111 etc.
{n,m}		Determines how many times a pattern needs to be repeated before matching. i.e. grep -E ‘regex{4,6}’ filename
\|		\| is or, allows to combine several patterns into one expression i.e. grep -E ‘regex1\|regex2’ filename
( )		Used to group particular strings of text for various roles i.e. backreferences, alternation, or simply readability
[{]		[ ] Used for matching the character without invoking the special meaning
Fixed strings / Fast grep	fgrep or grep -F
-c	Count	Counts the number of lines contain one or more instances of patter in a file i.e. fgrep -c ‘regex’ filename
-e		Used for searching more than one pattern or when the pattern begins with hyphen
-f	Outputs results to file	Outputs the results of the search into a file instead of printing it to the terminal
-h		When pattern is searched on more than one file, -h prevents fgrep from displaying filenames before the matched output.
-i	ignores case (capitalization)	-i option ignores capitalization in the pattern when matching it.
-l		Displays the files containing the pattern but not the matching lines.
-n	number of the line	Prints out the line number before the line that matches the pattern.
-v	reverse match	Matches any lines that do not contain the given pattern
Perl Style grep	grep -P	Perl-Compatible Regular Expression (PCRE)
PCRE-specific escapes
\a		Matches the alarm character
\cX		Matches ctrl+X, where X is any letter
\e		Matches escape character
\f		Matches from feed character
\n		Matches newline character
\r		Matches carriage return
\t		Matches tab character
\d		Matches any decimal digit
\D		Matches any non-decimal character
\s		Matches any whitespace character
\S		Matches any non-whitespace character
\w		Matches any word character
\W		Matches any non-word character
\b		Matches when at word boundary
\B		Matches when at not a word boundary
\A		Matches when at start of subject
\Z		Matches when at end of subject or before newline
\z		Matches when at end of subject
\G		Matches at first matching position

SPACE FOR NOTES:

Bioinformatics Tools

Pages

Saturday, March 8, 2014

Grep Tutorial