Package org.dlese.dpc.util
Class HTMLParser
java.lang.Object
org.dlese.dpc.util.HTMLParser
The HTMLParser class contains methods which allow an HTML document to be parsed. These
methods allow text in the document to be extracted, as well as the contents of Meta tags
Header (h1 , h2, h3, .. h6) tags, the Title tag, all the links in the page etc.
Example html document at http://www.abc.org: (for help with explaining the methods in this API)
ABC.ORG's MAIN PAGE
Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.
Welcome to ABC.ORG.
Hurricane season is here!
Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.
Middle school students can learn about hurricane science and safety with the Hurricane Strike module, while more advanced students can utilize the multimedia technology of the online meteorology guide Hurricanes.
One of ABC's newest collections, the NASA Scientific Visualization Studio, offers data, images and animations from previous Atlantic storms.
- Author:
- Sonal Bhushan
-
Constructor Summary
ConstructorsConstructorDescriptionHTMLParser(String resourcelocn) Constructor of an HTMLParser objectHTMLParser(String htmlcontent, String charset) Constructor of an HTMLParser object -
Method Summary
Modifier and TypeMethodDescriptionString[]returns a String array of all the links in the html document.returns all the text in the html page which is contained within header tags (which includesreturns a String containing all the text within the alt attribute of all the img tags in the html documentreturns a String containing all the text within the title attribute of all the links in the html documentreturns the content of the Meta tag whose name equals mname.returns the title of the HTML page , i.e.returns the text of the whole html document, stripped of all the HTML tags.booleanhasMetaTagName(String name) returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g.
-
Constructor Details
-
HTMLParser
Constructor of an HTMLParser object- Parameters:
resourcelocn- either a URL or the name of an HTML file- Throws:
org.htmlparser.util.ParserException- e.g.: HTMLParser hp = new HTMLParser("http://www.dlese.org"); HTMLParser hp2 = new HTMLParser(testthis.htm);
-
HTMLParser
Constructor of an HTMLParser object- Parameters:
htmlcontent- String containing the HTML to be parsedcharset- if null, the default encoding is used- Throws:
org.htmlparser.util.ParserException
-
-
Method Details
-
getHeaderText
returns all the text in the html page which is contained within header tags (which includes-
). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!
- Returns:
- text in the header tags in the html document
- Throws:
org.htmlparser.util.ParserException
-
getTitleText
returns the title of the HTML page , i.e. the text enclosed by the tag. If this tag is not present in the page, it returns an empty string. e.g. : HTMLParserhp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getTitleText()); This prints out the following : ABC.ORG's MAIN PAGE- Returns:
- text in the title tag(s) in the html doc.
- Throws:
org.htmlparser.util.ParserException
-
hasMetaTagName
returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); boolean containskeywords = hp.hasMetaTagName("keywords"); boolean containsxyz = hp.hasMetaTagName("xyz"); In this code, containskeywords will be true, and containsxyz will be false.- Parameters:
name- name of the Meta Tag- Returns:
- true or false, if this tag is present or not
- Throws:
org.htmlparser.util.ParserException
-
getMetaTagContentByName
returns the content of the Meta tag whose name equals mname. If such a tag does not exist, returns an empty string. E.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); if (hp.hasMetaTagName("organization")) { System.out.println(hp.getMetaTagContentByName("organization")); } This prints out the following : ABC Program Center- Parameters:
name- name of the Meta Tag- Returns:
- The value of this meta tag
- Throws:
org.htmlparser.util.ParserException
-
getAllLinks
returns a String array of all the links in the html document.- Returns:
- a string array of all the links
- Throws:
org.htmlparser.util.ParserException
-
getLinkTitles
returns a String containing all the text within the title attribute of all the links in the html document- Returns:
- all the text within the title attribute of all the links in the doc.
- Throws:
org.htmlparser.util.ParserException
-
getImgAlts
returns a String containing all the text within the alt attribute of all the img tags in the html document- Returns:
- all the text within the alt attribute of all the img tahs in the html doc
- Throws:
org.htmlparser.util.ParserException
-
getWholeText
returns the text of the whole html document, stripped of all the HTML tags. This text also includes the text within the alt attribute of all the img tags, as well as the text within the title attribute of all the link tags.- Returns:
- The wholeText value
- Throws:
org.htmlparser.util.ParserException
-