# split up a paragraph into sentences # using regular expressions def splitParagraphIntoSentences(paragraph): ''' break a paragraph into sentences and return a list ''' import re # to split by multile characters # regular expressions are easiest (and fastest) sentenceEnders = re.compile('[.!?]') sentenceList = sentenceEnders.split(paragraph) return sentenceList if __name__ == '__main__': p = """This is a sentence. This is an excited sentence! And do you think this is a question?""" sentences = splitParagraphIntoSentences(p) for s in sentences: print s.strip() #output: # This is a sentence # This is an excited sentence # And do you think this is a question
A python example based blog that shows how to accomplish python goals and how to correct python errors.
Monday, September 28, 2009
python - split paragraph into sentences with regular expressions
Subscribe to:
Post Comments (Atom)
"You may have e.g. an abbreviation within the sentence. Don't you think?!"
ReplyDelete@adam
ReplyDeleteThanks for the scrutiny! You are right, abbreviations are fairly common. If I were going to try and handle that scenario I would change my regular expression to read: '[.!?][\s]{1,2}[A-Z]'
That way I look for a block of text and then a couple spaces and then a capital letter starting another sentence.
Of course you could still have a sentence like Dr. Pepper and it would split between Dr. and Pepper. But you could always account for that by enforcing 2 spaces between sentences by changing {1,2} to {2}.
Enjoy!
Hey Steve this is cool but i can`t understand that why the first character of the next sentence is getting eliminated.
ReplyDeleteFor eg.if we run:
Sheffield is beautiful city. It was named after river Sheaf.
The o/p we get is
Sheffied is beautiful city.
t was named after river Sheaf.
@Samar
ReplyDeleteHi Samar,
You are right, I get the same results. The problem with the regular expression is that I left the [A-Z] section which would include one letter as the separator. Changing the expression to '[.!?][\s]{1,2}' will stop the script from eliminating the first letter of the next sentence.
Thanx a lot for the reply, but when i tried doing that it was back to normal i.e. no abbreviations were captured.
ReplyDeleteGot the solution it so happens because even the first letter gets captured and hence it disappears, modifying the regular expression to [.!?][\s]{1,2}(?=[A-Z]) solves the problem. :)
ReplyDeleteThank you very much....this code very helpful
ReplyDelete