Thursday, September 19, 2013

Getting Alt Text from IMG tags in HTML using Pattern Regex - Java

I was given the task to retrieve alt= from an img tag inside an html string. Here is an example:

<div class="rc_release_list_item_picture">
 <a href="/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/"><img width="87" height="87" border="0" src="" alt="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" title="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" /></a>

I used the Pattern and Matcher Class provided by the java.util package. The following is a sample of quick main I built to achieve my task:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

 * User: GKproggy
public class Main
    private static String testStrings = "<div class=\"rc_release_list_item_picture\">\n" +
            " <a href=\"/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/\"><img width=\"87\" height=\"87\" border=\"0\" src=\"\" alt=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" title=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" /></a>\n" +
    public static void main(String[] args)
        String regexPattern = "<img[^>]*alt=[\"]*([\\w\\s-.:\\/,]+)[\"]*[^>]*/>";
        Pattern p = Pattern.compile(regexPattern);
        Matcher m = p.matcher(testStrings);


Hope this was a quick help to get an idea how to use regex to retrieve info you want from raw HTML.

1 comment:

  1. Inside the betting world, every person put gambles on a number of gambling video major totosite, and there are many folks who also love to set gambles in the sports tournaments. Sports betting is gaining interest daily. When web users utilize this internet site, they will receive knowledge about the totosite.