org.dlese.dpc.util.HTMLParser

public class HTMLParser extends Object

The HTMLParser class contains methods which allow an HTML document to be parsed. These methods allow text in the document to be extracted, as well as the contents of Meta tags Header (h1 , h2, h3, .. h6) tags, the Title tag, all the links in the page etc. Example html document at http://www.abc.org: (for help with explaining the methods in this API) ABC.ORG's MAIN PAGE

Welcome to ABC.ORG.

Hurricane season is here!

Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.

Middle school students can learn about hurricane science and safety with the Hurricane Strike module, while more advanced students can utilize the multimedia technology of the online meteorology guide Hurricanes.

One of ABC's newest collections, the NASA Scientific Visualization Studio, offers data, images and animations from previous Atlantic storms.

Author:: Sonal Bhushan

Constructor Summary

Constructors

Constructor

Description

HTMLParser(String resourcelocn)

Constructor of an HTMLParser object

HTMLParser(String htmlcontent, String charset)

Constructor of an HTMLParser object
Method Summary

Modifier and Type

Method

Description

String[]

getAllLinks()

returns a String array of all the links in the html document.

String

getHeaderText()

returns all the text in the html page which is contained within header tags (which includes

String

getImgAlts()

returns a String containing all the text within the alt attribute of all the img tags in the html document

String

getLinkTitles()

returns a String containing all the text within the title attribute of all the links in the html document

String

getMetaTagContentByName(String name)

returns the content of the Meta tag whose name equals mname.

String

getTitleText()

returns the title of the HTML page , i.e.

String

getWholeText()

returns the text of the whole html document, stripped of all the HTML tags.

boolean

hasMetaTagName(String name)

returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HTMLParser
  
  public HTMLParser(String resourcelocn) throws org.htmlparser.util.ParserException
  
  Constructor of an HTMLParser object
  
  Parameters:
  
  resourcelocn - either a URL or the name of an HTML file
  
  Throws:
  
  org.htmlparser.util.ParserException - e.g.: HTMLParser hp = new HTMLParser("http://www.dlese.org"); HTMLParser hp2 = new HTMLParser(testthis.htm);
- HTMLParser
  
  public HTMLParser(String htmlcontent, String charset) throws org.htmlparser.util.ParserException
  
  Constructor of an HTMLParser object
  
  Parameters:
  
  htmlcontent - String containing the HTML to be parsed
  
  charset - if null, the default encoding is used
  
  Throws:
  
  org.htmlparser.util.ParserException
Method Details
- getHeaderText
  
  public String getHeaderText() throws org.htmlparser.util.ParserException
  
  returns all the text in the html page which is contained within header tags (which includes
  -
  ). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!
  
  Returns:
  
  text in the header tags in the html document
  
  Throws:
  
  org.htmlparser.util.ParserException
- getTitleText
  
  public String getTitleText() throws org.htmlparser.util.ParserException
  
  returns the title of the HTML page , i.e. the text enclosed by the tag. If this tag is not present in the page, it returns an empty string. e.g. : HTMLParserhp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getTitleText()); This prints out the following : ABC.ORG's MAIN PAGE
  
  Returns:
  
  text in the title tag(s) in the html doc.
  
  Throws:
  
  org.htmlparser.util.ParserException
- hasMetaTagName
  
  public boolean hasMetaTagName(String name) throws org.htmlparser.util.ParserException
  
  returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); boolean containskeywords = hp.hasMetaTagName("keywords"); boolean containsxyz = hp.hasMetaTagName("xyz"); In this code, containskeywords will be true, and containsxyz will be false.
  
  Parameters:
  
  name - name of the Meta Tag
  
  Returns:
  
  true or false, if this tag is present or not
  
  Throws:
  
  org.htmlparser.util.ParserException
- getMetaTagContentByName
  
  public String getMetaTagContentByName(String name) throws org.htmlparser.util.ParserException
  
  returns the content of the Meta tag whose name equals mname. If such a tag does not exist, returns an empty string. E.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); if (hp.hasMetaTagName("organization")) { System.out.println(hp.getMetaTagContentByName("organization")); } This prints out the following : ABC Program Center
  
  Parameters:
  
  name - name of the Meta Tag
  
  Returns:
  
  The value of this meta tag
  
  Throws:
  
  org.htmlparser.util.ParserException
- getAllLinks
  
  public String[] getAllLinks() throws org.htmlparser.util.ParserException
  
  returns a String array of all the links in the html document.
  
  Returns:
  
  a string array of all the links
  
  Throws:
  
  org.htmlparser.util.ParserException
- getLinkTitles
  
  public String getLinkTitles() throws org.htmlparser.util.ParserException
  
  returns a String containing all the text within the title attribute of all the links in the html document
  
  Returns:
  
  all the text within the title attribute of all the links in the doc.
  
  Throws:
  
  org.htmlparser.util.ParserException
- getImgAlts
  
  public String getImgAlts() throws org.htmlparser.util.ParserException
  
  returns a String containing all the text within the alt attribute of all the img tags in the html document
  
  Returns:
  
  all the text within the alt attribute of all the img tahs in the html doc
  
  Throws:
  
  org.htmlparser.util.ParserException
- getWholeText
  
  public String getWholeText() throws org.htmlparser.util.ParserException
  
  returns the text of the whole html document, stripped of all the HTML tags. This text also includes the text within the alt attribute of all the img tags, as well as the text within the title attribute of all the link tags.
  
  Returns:
  
  The wholeText value
  
  Throws:
  
  org.htmlparser.util.ParserException

Class HTMLParser

Welcome to ABC.ORG.

Hurricane season is here!

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

HTMLParser

HTMLParser

Method Details

getHeaderText

-

). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!

getTitleText

hasMetaTagName

getMetaTagContentByName

getAllLinks

getLinkTitles

getImgAlts

getWholeText