Message 94056 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	verdy_p
Recipients	ezio.melotti, r.david.murray, verdy_p
Date	2009-10-14.23:52:10
SpamBayes Score	9.536261e-13
Marked as misclassified	No
Message-id	<[email protected]>
In-reply-to

Content
> That's why I wrote 'without checking if they are in range(256)'; the fact that this regex matches invalid digits was not relevant in my example (and it's usually easier to convert the digits to int and check if 0 <= digits <= 255). :) NO ! You have to check also the number of digits for values below 100 (2 digits only) or below 10 (1 digit only) And when processing web log files for example, or when parsing Wiki pages or emails in which you want to autodetect the presence of ONLY valid IP addresses within some contexts, where you want to transform them to another form (for example when converting them to links or to differentiate 'anonymous' users in wiki pages from registered named users, you need to correctly match these IP addresses. In addition, these files will often contain many other occurences that you don't want to transform, but just some of them in specific contexts given by the regexp. for this reason, your suggestion will often not work as expected. The real need is to match things exactly, within their context, and capturing all occurences of capturing groups. I gave the IPv4 regexp only as a simple example to show the need, but there are of course much more complex cases, and that's exactly for those cases that I would like the extension: using alternate code with partial matches and extra split() operations give a code that becomes tricky, and most often bogous. Only the original regexp is precise enough to parse the content correctly, find only the matches we want, and capturing all the groups that we really want, in a single operation, and with a near-zero cost (and without complication in the rest of the code using it).

> That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)

NO ! You have to check also the number of digits for values below 100 (2 
digits only) or below 10 (1 digit only)

And when processing web log files for example, or when parsing Wiki 
pages or emails in which you want to autodetect the presence of ONLY 
valid IP addresses within some contexts, where you want to transform 
them to another form (for example when converting them to links or to 
differentiate 'anonymous' users in wiki pages from registered named 
users, you need to correctly match these IP addresses. In addition, 
these files will often contain many other occurences that you don't want 
to transform, but just some of them in specific contexts given by the 
regexp. for this reason, your suggestion will often not work as 
expected.

The real need is to match things exactly, within their context, and 
capturing all occurences of capturing groups.

I gave the IPv4 regexp only as a simple example to show the need, but 
there are of course much more complex cases, and that's exactly for 
those cases that I would like the extension: using alternate code with 
partial matches and extra split() operations give a code that becomes 
tricky, and most often bogous. Only the original regexp is precise 
enough to parse the content correctly, find only the matches we want, 
and capturing all the groups that we really want, in a single operation, 
and with a near-zero cost (and without complication in the rest of the 
code using it).

History
Date	User	Action	Args
2009-10-14 23:52:12	verdy_p	set	recipients: + verdy_p, ezio.melotti, r.david.murray
2009-10-14 23:52:12	verdy_p	set	messageid: <[email protected]>
2009-10-14 23:52:10	verdy_p	link	issue7132 messages
2009-10-14 23:52:10	verdy_p	create