Python: built-in module parser

parser

index
c:\users\gige\pycharmprojects\oisisi_python\parse\parser.py

# -*- coding: utf-8 -*-

Modules

os
re

Classes



html.parser.HTMLParser(_markupbase.ParserBase)

Parser

class Parser(html.parser.HTMLParser)

    Parser(*, convert_charrefs=True) Parser HTML dokumenata Upotreba:     parser = Parser()     parser.parse(FILE_PATH)

Method resolution order:

Parser

html.parser.HTMLParser

_markupbase.ParserBase

builtins.object

Methods defined here:

error(self, message)

handle_data(self, data)
Metoda beleži pronađene reči Poziv metode vrši se implicitno prilikom nailaska na sadržaj HTML elemenata. Sadržaj elementa se deli u reči koje se beleže u odgovarajuću listu. Argument: - `data`: dobijeni sadržaj elementa

handle_starttag(self, tag, attrs)
Metoda beleži sadržaj href atributa Poziv metode vrši se implicitno prilikom nailaska na tag unutar HTML fajla. Ukoliko je u pitanju anchor tag, beleži se vrednost href atributa. Argumenti: - `tag`: naziv taga - `attrs`: lista atributa

parse(self, path)
Metoda učitava sadržaj fajla i prosleđuje ga parseru Argument: - `path`: putanja do fajla

Methods inherited from html.parser.HTMLParser:

__init__(self, *, convert_charrefs=True)
Initialize and reset this instance. If convert_charrefs is True (the default), all character references are automatically converted to the corresponding Unicode characters.

check_for_whole_start_tag(self, i)
# Internal -- check to see if we have a complete starttag; return end # or -1 if incomplete.

clear_cdata_mode(self)

close(self)
Handle any buffered data.

feed(self, data)
Feed data to the parser. Call this as often as you want, with as little or as much text as you want (may include '\n').

get_starttag_text(self)
Return full source of start tag: '<...>'.

goahead(self, end)
# Internal -- handle data as far as reasonable.  May leave state # and data to be processed by a subsequent call.  If 'end' is # true, force handling all data as if followed by EOF marker.

handle_charref(self, name)
# Overridable -- handle character reference

handle_comment(self, data)
# Overridable -- handle comment

handle_decl(self, decl)
# Overridable -- handle declaration

handle_endtag(self, tag)
# Overridable -- handle end tag

handle_entityref(self, name)
# Overridable -- handle entity reference

handle_pi(self, data)
# Overridable -- handle processing instruction

handle_startendtag(self, tag, attrs)
# Overridable -- finish processing of start+end tag: <tag.../>

parse_bogus_comment(self, i, report=1)
# Internal -- parse bogus comment, return length or -1 if not terminated # see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state

parse_endtag(self, i)
# Internal -- parse endtag, return end or -1 if incomplete

parse_html_declaration(self, i)
# Internal -- parse html declarations, return length or -1 if not terminated # See w3.org/TR/html5/tokenization.html#markup-declaration-open-state # See also parse_declaration in _markupbase

parse_pi(self, i)
# Internal -- parse processing instr, return end or -1 if not terminated

parse_starttag(self, i)
# Internal -- handle starttag, return end or -1 if not terminated

reset(self)
Reset this instance.  Loses all unprocessed data.

set_cdata_mode(self, elem)

unescape(self, s)
# Internal -- helper to remove special character quoting

unknown_decl(self, data)

Data and other attributes inherited from html.parser.HTMLParser:

CDATA_CONTENT_ELEMENTS = ('script', 'style')

Methods inherited from _markupbase.ParserBase:

getpos(self)
Return current line number and offset.

parse_comment(self, i, report=1)
# Internal -- parse comment, return length or -1 if not terminated

parse_declaration(self, i)
# Internal -- parse declaration (for use by subclasses).

parse_marked_section(self, i, report=1)
# Internal -- parse a marked section # Override this to handle MS-word extension syntax <![if word]>content<![endif]>

updatepos(self, i, j)
# Internal -- update line number and offset.  This should be # called for each piece of data exactly once, in order -- in other # words the concatenation of all the input strings to this # function should be exactly the entire input.

Data descriptors inherited from _markupbase.ParserBase:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)