mardi 12 juillet 2016

Using regex to deal with escape characters in URLs


I'm in the process of tokenizing strings which contain URLs. Here is the part I use to pick up the URLs:

regex_str = [r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-f][0-9a-f]))+']

It picks up "regular" URLs perfectly fine; however some of the URLs look like this:

https://t.co/c1taPXzi4X

How can I modify the regex so that it deals with the escape characters, in order to end up with a complete and clean URL?

Many thanks in advance! :)


Aucun commentaire:

Enregistrer un commentaire