Class HTMLParser

java.lang.Object
org.dlese.dpc.util.HTMLParser

public class HTMLParser extends Object
The HTMLParser class contains methods which allow an HTML document to be parsed. These methods allow text in the document to be extracted, as well as the contents of Meta tags Header (h1 , h2, h3, .. h6) tags, the Title tag, all the links in the page etc. Example html document at http://www.abc.org: (for help with explaining the methods in this API) ABC.ORG's MAIN PAGE

Welcome to ABC.ORG.

Hurricane season is here!

abc logo
Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.

Middle school students can learn about hurricane science and safety with the Hurricane Strike module, while more advanced students can utilize the multimedia technology of the online meteorology guide Hurricanes.

One of ABC's newest collections, the NASA Scientific Visualization Studio, offers data, images and animations from previous Atlantic storms.

Author:
Sonal Bhushan
  • Constructor Details

    • HTMLParser

      public HTMLParser(String resourcelocn) throws org.htmlparser.util.ParserException
      Constructor of an HTMLParser object
      Parameters:
      resourcelocn - either a URL or the name of an HTML file
      Throws:
      org.htmlparser.util.ParserException - e.g.: HTMLParser hp = new HTMLParser("http://www.dlese.org"); HTMLParser hp2 = new HTMLParser(testthis.htm);
    • HTMLParser

      public HTMLParser(String htmlcontent, String charset) throws org.htmlparser.util.ParserException
      Constructor of an HTMLParser object
      Parameters:
      htmlcontent - String containing the HTML to be parsed
      charset - if null, the default encoding is used
      Throws:
      org.htmlparser.util.ParserException
  • Method Details

    • getHeaderText

      public String getHeaderText() throws org.htmlparser.util.ParserException
      returns all the text in the html page which is contained within header tags (which includes

      -

      ). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!
      Returns:
      text in the header tags in the html document
      Throws:
      org.htmlparser.util.ParserException
    • getTitleText

      public String getTitleText() throws org.htmlparser.util.ParserException
      returns the title of the HTML page , i.e. the text enclosed by the tag. If this tag is not present in the page, it returns an empty string. e.g. : HTMLParserhp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getTitleText()); This prints out the following : ABC.ORG's MAIN PAGE
      Returns:
      text in the title tag(s) in the html doc.
      Throws:
      org.htmlparser.util.ParserException
    • hasMetaTagName

      public boolean hasMetaTagName(String name) throws org.htmlparser.util.ParserException
      returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); boolean containskeywords = hp.hasMetaTagName("keywords"); boolean containsxyz = hp.hasMetaTagName("xyz"); In this code, containskeywords will be true, and containsxyz will be false.
      Parameters:
      name - name of the Meta Tag
      Returns:
      true or false, if this tag is present or not
      Throws:
      org.htmlparser.util.ParserException
    • getMetaTagContentByName

      public String getMetaTagContentByName(String name) throws org.htmlparser.util.ParserException
      returns the content of the Meta tag whose name equals mname. If such a tag does not exist, returns an empty string. E.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); if (hp.hasMetaTagName("organization")) { System.out.println(hp.getMetaTagContentByName("organization")); } This prints out the following : ABC Program Center
      Parameters:
      name - name of the Meta Tag
      Returns:
      The value of this meta tag
      Throws:
      org.htmlparser.util.ParserException
    • getAllLinks

      public String[] getAllLinks() throws org.htmlparser.util.ParserException
      returns a String array of all the links in the html document.
      Returns:
      a string array of all the links
      Throws:
      org.htmlparser.util.ParserException
    • getLinkTitles

      public String getLinkTitles() throws org.htmlparser.util.ParserException
      returns a String containing all the text within the title attribute of all the links in the html document
      Returns:
      all the text within the title attribute of all the links in the doc.
      Throws:
      org.htmlparser.util.ParserException
    • getImgAlts

      public String getImgAlts() throws org.htmlparser.util.ParserException
      returns a String containing all the text within the alt attribute of all the img tags in the html document
      Returns:
      all the text within the alt attribute of all the img tahs in the html doc
      Throws:
      org.htmlparser.util.ParserException
    • getWholeText

      public String getWholeText() throws org.htmlparser.util.ParserException
      returns the text of the whole html document, stripped of all the HTML tags. This text also includes the text within the alt attribute of all the img tags, as well as the text within the title attribute of all the link tags.
      Returns:
      The wholeText value
      Throws:
      org.htmlparser.util.ParserException