Monday, September 28, 2009

python - split paragraph into sentences with regular expressions

# split up a paragraph into sentences
# using regular expressions

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question


  1. "You may have e.g. an abbreviation within the sentence. Don't you think?!"

  2. @adam
    Thanks for the scrutiny! You are right, abbreviations are fairly common. If I were going to try and handle that scenario I would change my regular expression to read: '[.!?][\s]{1,2}[A-Z]'
    That way I look for a block of text and then a couple spaces and then a capital letter starting another sentence.
    Of course you could still have a sentence like Dr. Pepper and it would split between Dr. and Pepper. But you could always account for that by enforcing 2 spaces between sentences by changing {1,2} to {2}.

  3. Hey Steve this is cool but i can`t understand that why the first character of the next sentence is getting eliminated.
    For eg.if we run:
    Sheffield is beautiful city. It was named after river Sheaf.
    The o/p we get is
    Sheffied is beautiful city.
    t was named after river Sheaf.

  4. @Samar
    Hi Samar,
    You are right, I get the same results. The problem with the regular expression is that I left the [A-Z] section which would include one letter as the separator. Changing the expression to '[.!?][\s]{1,2}' will stop the script from eliminating the first letter of the next sentence.

  5. Thanx a lot for the reply, but when i tried doing that it was back to normal i.e. no abbreviations were captured.

  6. Got the solution it so happens because even the first letter gets captured and hence it disappears, modifying the regular expression to [.!?][\s]{1,2}(?=[A-Z]) solves the problem. :)

  7. Thank you very much....this code very helpful