|
|
ru.perl- RU.PERL ---------------------------------------------------------------------- From : Valentin Ermolaev 2:463/544.12 24 Jun 2002 00:40:07 To : Ivan V. Klepikov Subject : стринги -------------------------------------------------------------------------------- IVK> задача такова. есть небольшой хтмл-файл. как сделать так, чтобы в нем IVK> поубирались все тэги? я что-то не догоню немножко как эти регекспы IVK> формируются... поможите! == [perldoc -q 'remove html'] == =head1 Found in /usr/libdata/perl/5.00503/pod/perlfaq9.pod =head2 How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the HTML-Tree package on CPAN). Many folks attempt a simple-minded regular expression approach, like C<s/E<lt>.*?E<gt>//g>, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like C<<> for example. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz . Here are some tricky cases that you should think about when picking a solution: <IMG SRC = "foo.gif" ALT = "A > B"> <IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]> If HTML comments include other tags, those solutions would also break on text like this: <!-- This section commented out. <B>You can't see me!</B> --> == [perldoc -q 'remove html'] == --- [VE2-UANIC] * Origin: (2:463/544.12) Вернуться к списку тем, сортированных по: возрастание даты уменьшение даты тема автор
Архивное /ru.perl/33150d164056.html, оценка из 5, голосов 10
|