www.cs.rochester.edu Open in urlscan Pro
128.151.167.12  Public Scan

Submitted URL: http://www.cs.rochester.edu//u//nelson//courses//csc_173//fa//re.html
Effective URL: https://www.cs.rochester.edu//u//nelson//courses//csc_173//fa//re.html
Submission: On July 31 via api from US — Scanned from US

Form analysis 0 forms found in the DOM

Text Content

REGULAR EXPRESSIONS

Just as finite automata are used to recognize patterns of strings, regular
expressions are used to generate patterns of strings. A regular expression is an
algebraic formula whose value is a pattern consisting of a set of strings,
called the language of the expression.

Operands in a regular expression can be:

 * characters from the alphabet over which the regular expression is defined.
 * variables whose values are any pattern defined by a regular expression.
 * epsilon which denotes the empty string containing no characters.
 * null which denotes the empty set of strings.

Operators used in regular expressions include:

 * Union: If R1 and R2 are regular expressions, then R1 | R2 (also written as R1
   U R2 or R1 + R2) is also a regular expression.
   
   L(R1|R2) = L(R1) U L(R2).
   
   

 * Concatenation: If R1 and R2 are regular expressions, then R1R2 (also written
   as R1.R2) is also a regular expression.
   
   L(R1R2) = L(R1) concatenated with L(R2).
   
   

 * Kleene closure: If R1 is a regular expression, then R1* (the Kleene closure
   of R1) is also a regular expression.
   
   L(R1*) = epsilon U L(R1) U L(R1R1) U L(R1R1R1) U ...

Closure has the highest precedence, followed by concatenation, followed by
union.

--------------------------------------------------------------------------------


EXAMPLES

The set of strings over {0,1} that end in 3 consecutive 1's.

      (0 | 1)* 111
  

The set of strings over {0,1} that have at least one 1.

      0* 1 (0 | 1)*
  

The set of strings over {0,1} that have at most one 1.

      0* | 0* 1 0*
  

The set of strings over {A..Z,a..z} that contain the word "main".

      Let <letter> = A | B | ... | Z | a | b | ... | z
  
      <letter>* main <letter>*
  

The set of strings over {A..Z,a..z} that contain 3 x's.

      <letter>* x <letter>* x <letter>* x <letter>*
  

The set of identifiers in Pascal.

      Let <letter> = A | B | ... | Z | a | b | ... | z
      Let <digit> = 0 | 1 | 2 | 3 ... | 9
  
      <letter> (<letter> | <digit>)*
  

The set of real numbers in Pascal.

      Let <digit> = 0 | 1 | 2 | 3 ... | 9
      Let <exp> = 'E' <sign> <digit> <digit>* | epsilon
      Let <sign> = '+' | '-' | epsilon
      Let <decimal> = '.' <digit> <digit>* | epsilon
  
      <digit> <digit>* <decimal> <exp>
  

--------------------------------------------------------------------------------


UNIX OPERATOR EXTENSIONS

Regular expressions are used frequently in Unix:

 * In the command line
 * Within text editors
 * In the context of pattern matching programs such as grep and egrep

To facilitate construction of regular expressions, Unix recognizes additional
operators. These operators can be defined in terms of the operators given above;
they represent a notational convenience only.

 * character classes: '[' <list of chars> ']'
 * start of a line: '^'
 * end of a line: '$'
 * wildcard matching any character except newline: '.'
 * optional instance: R? = epsilon | R
 * one or more instances: R+ == RR*

--------------------------------------------------------------------------------


EQUIVALENCE OF REGULAR EXPRESSIONS AND FINITE AUTOMATA

Regular expressions and finite automata have equivalent expressive power:

 * For every regular expression R, there is a corresponding FA that accepts the
   set of strings generated by R.
   
   

 * For every FA A there is a corresponding regular expression that generates the
   set of strings accepted by A.

The proof is in two parts:

 1. an algorithm that, given a regular expression R, produces an FA A such that
    L(A) == L(R).
    
    

 2. an algorithm that, given an FA A, produces a regular expression R such that
    L(R) == L(A).

Our construction of FA from regular expressions will allow "epsilon transitions"
(a transition from one state to another with epsilon as the label). Such a
transition is always possible, since epsilon (or the empty string) can be said
to exist between any two input symbols. We can show that such epsilon
transitions are a notational convenience; for every FA with epsilon transitions
there is a corresponding FA without them.

--------------------------------------------------------------------------------


CONSTRUCTING AN FA FROM AN RE

We begin by showing how to construct an FA for the operands in a regular
expression.

 * If the operand is a character c, then our FA has two states, s0 (the start
   state) and sF (the final, accepting state), and a transition from s0 to sF
   with label c.
   
   

 * If the operand is epsilon, then our FA has two states, s0 (the start state)
   and sF (the final, accepting state), and an epsilon transition from s0 to sF.
   
   

 * If the operand is null, then our FA has two states, s0 (the start state) and
   sF (the final, accepting state), and no transitions.

Given FA for R1 and R2, we now show how to build an FA for R1R2, R1|R2, and R1*.
Let A (with start state a0 and final state aF) be the machine accepting L(R1)
and B (with start state b0 and final state bF) be the machine accepting L(R2).



 * The machine C accepting L(R1R2) includes A and B, with start state a0, final
   state bF, and an epsilon transition from aF to b0.
   
   

 * The machine C accepting L(R1|R2) includes A and B, with a new start state c0,
   a new final state cF, and epsilon transitions from c0 to a0 and b0, and from
   aF and bF to cF.
   
   

 * The machine C accepting L(R1*) includes A, with a new start state c0, a new
   final state cF, and epsilon transitions from c0 to a0 and cF, and from aF to
   a0, and from aF to cF.

--------------------------------------------------------------------------------


ELIMINATING EPSILON TRANSITIONS

If we can eliminate epsilon transitions from an FA, then our construction of an
FA from a regular expression (which yields an FA with epsilon transitions) can
be completed.

Observe that epsilon transitions are similar to nondeterminism in that they
offer a choice: an epsilon transition allows us to stay in a state or move to a
new state, regardless of the input symbol.

If starting in state s1, we can reach state s2 via a series of epsilon
transitions followed by a transition on input symbol x, we can replace all of
the epsilon transitions with a single transition from s1 to s2 on symbol x.

--------------------------------------------------------------------------------


ALGORITHM FOR ELIMINATING EPSILON TRANSITIONS

We can build a finite automaton F2 with no epsilon transitions from a finite
automaton F1 containing epsilon transitions as follows:

    

 1. The states of F2 are all the states of F1 that have an entering transition
    labeled by some symbol other than epsilon, plus the start state of F1, which
    is also the start state of F2.
    
    

 2. For each state in F1, determine which other states are reachable via epsilon
    transitions only. If a state of F1 can reach a final state in F1 via epsilon
    transitions, then the corresponding state is a final state in F2.
    
    

 3. For each pair of states i and j in F2, there is a transition from state i to
    state j on input x if there exists a state k that is reachable from state i
    via epsilon transitions in F1, and there is a transition in F1 from state k
    to state j on input x.

--------------------------------------------------------------------------------


CONSTRUCTING AN RE FROM AN FA

To construct a regular expression from a DFA (and thereby complete the proof
that regular expressions and finite automata have the same expressive power), we
replace each state in the DFA one by one with a corresponding regular
expression.

Just as we built a small FA for each operator and operand in a regular
expression, we will now build a small regular expression for each state in the
DFA.

The basic idea is to eliminate the states of the FA one by one, replacing each
state with a regular expression that generates the portion of the input string
that labels the transitions into and out of the state being eliminated.

--------------------------------------------------------------------------------


ALGORITHM FOR CONSTRUCTING AN RE FROM AN FA

Given a DFA F we construct a regular expression R such that
L(F) == L(R).

We preprocess the FA, turning the labels on transitions into regular
expressions. If there is a transition with label {a,b}, then we replace the
label with the regular expression a | b. If there is no transition from a state
to itself, we can add one with the label NULL.

For each accepting state sF in F, eliminate all states in F except the start
state s0 and sF.

To eliminate a state sE, consider all pairs of states sA and sB such that there
is a transition from sA to sE with label R1, a transition from sE to sE with
label R2 (possibly null, meaning no transition), and a transition from sE to sB
with label R3. Introduce a transition from sA to sB with label R1R2*R3. If there
is already a transition from sA to sB with label R4, then replace that label
with R4|R1R2*R3.

After eliminating all states except s0 and sF:

 * If s0 == sF, then the resulting regular expression is R1*, where R is the
   label on the transition from s0 to s0.
   
   

 * If s0 != sF, then assume the transition from s0 to s0 is labeled R1, the
   transition from s0 to sF is labeled R2, the transition from sF to sF is
   labeled R3, and the transition from sF to s0 is labeled R4. The resulting
   regular expression is R1*R2(R3 | R4R1*R2)*

Let RFi be the regular expression produced by eliminating all the states except
s0 and sFi. If there are n final states in the DFA, then the regular expression
that generates the strings accepted by the original DFA is RF1 | RF2 | ... RFn.

--------------------------------------------------------------------------------


SUMMARY OF RESULTS

We have shown that all four of the following formalisms for expressing languages
of strings are equivalent:

 * deterministic finite automata
   
   

 * nondeterministic finite automata
   
   

 * nondeterministic finite automata with epsilon transitions
   
   

 * regular expressions