Thursday, September 19, 2013

Getting Alt Text from IMG tags in HTML using Pattern Regex - Java

I was given the task to retrieve alt= from an img tag inside an html string. Here is an example:

<div class="rc_release_list_item_picture">
 <a href="/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/"><img width="87" height="87" border="0" src="" alt="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" title="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" /></a>

I used the Pattern and Matcher Class provided by the java.util package. The following is a sample of quick main I built to achieve my task:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

 * User: GKproggy
public class Main
    private static String testStrings = "<div class=\"rc_release_list_item_picture\">\n" +
            " <a href=\"/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/\"><img width=\"87\" height=\"87\" border=\"0\" src=\"\" alt=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" title=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" /></a>\n" +
    public static void main(String[] args)
        String regexPattern = "<img[^>]*alt=[\"]*([\\w\\s-.:\\/,]+)[\"]*[^>]*/>";
        Pattern p = Pattern.compile(regexPattern);
        Matcher m = p.matcher(testStrings);


Hope this was a quick help to get an idea how to use regex to retrieve info you want from raw HTML.

1 comment:

