Pythonic Prose: python - split paragraph into sentences with regular expressions

Monday, September 28, 2009

python - split paragraph into sentences with regular expressions

# split up a paragraph into sentences
# using regular expressions


def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList


if __name__ == '__main__':
    p = """This is a sentence.  This is an excited sentence! And do you think this is a question?"""

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

#output:
#   This is a sentence
#   This is an excited sentence

#   And do you think this is a question

7 comments:

adamNovember 1, 2010 at 1:23 PM
"You may have e.g. an abbreviation within the sentence. Don't you think?!"
ReplyDelete
Replies
steveNovember 1, 2010 at 3:23 PM
@adam
Thanks for the scrutiny! You are right, abbreviations are fairly common. If I were going to try and handle that scenario I would change my regular expression to read: '[.!?][\s]{1,2}[A-Z]'
That way I look for a block of text and then a couple spaces and then a capital letter starting another sentence.
Of course you could still have a sentence like Dr. Pepper and it would split between Dr. and Pepper. But you could always account for that by enforcing 2 spaces between sentences by changing {1,2} to {2}.
Enjoy!
ReplyDelete
Replies
SiriusAugust 7, 2011 at 5:01 AM
Hey Steve this is cool but i can`t understand that why the first character of the next sentence is getting eliminated.
For eg.if we run:
Sheffield is beautiful city. It was named after river Sheaf.
The o/p we get is
Sheffied is beautiful city.
t was named after river Sheaf.
ReplyDelete
Replies
steveAugust 8, 2011 at 12:42 PM
@Samar
Hi Samar,
You are right, I get the same results. The problem with the regular expression is that I left the [A-Z] section which would include one letter as the separator. Changing the expression to '[.!?][\s]{1,2}' will stop the script from eliminating the first letter of the next sentence.
ReplyDelete
Replies
SiriusAugust 10, 2011 at 12:06 PM
Thanx a lot for the reply, but when i tried doing that it was back to normal i.e. no abbreviations were captured.
ReplyDelete
Replies
SiriusAugust 12, 2011 at 1:56 PM
Got the solution it so happens because even the first letter gets captured and hence it disappears, modifying the regular expression to [.!?][\s]{1,2}(?=[A-Z]) solves the problem. :)
ReplyDelete
Replies
UnknownJune 24, 2015 at 1:05 AM
Thank you very much....this code very helpful
ReplyDelete
Replies