I'm in the process of tokenizing strings which contain URLs. Here is the part I use to pick up the URLs:
regex_str = [r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-f][0-9a-f]))+']
It picks up "regular" URLs perfectly fine; however some of the URLs look like this:
https://t.co/c1taPXzi4X
How can I modify the regex so that it deals with the escape characters, in order to end up with a complete and clean URL?
Many thanks in advance! :)
Aucun commentaire:
Enregistrer un commentaire