Thursday, September 19, 2013

Getting Alt Text from IMG tags in HTML using Pattern Regex - Java

I was given the task to retrieve alt= from an img tag inside an html string. Here is an example:


<div class="rc_release_list_item_picture">
 <a href="/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/"><img width="87" height="87" border="0" src="http://n.image.weareone.fm/news/_newsgrafiken/2013/_releases/lames/24-07-2013--fatboy-slim-and-riva-starr-eat-sleep-rave-repeat_s.png" alt="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" title="Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat" /></a>
</div>


I used the Pattern and Matcher Class provided by the java.util package. The following is a sample of quick main I built to achieve my task:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * User: GKproggy
 */
public class Main
{
    private static String testStrings = "<div class=\"rc_release_list_item_picture\">\n" +
            " <a href=\"/release/fatboy-slim-and-riva-starr/21683/eat-sleep-rave-repeat/\"><img width=\"87\" height=\"87\" border=\"0\" src=\"http://n.image.weareone.fm/news/_newsgrafiken/2013/_releases/lames/24-07-2013--fatboy-slim-and-riva-starr-eat-sleep-rave-repeat_s.png\" alt=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" title=\"Fatboy Slim and Riva Starr - Eat, Sleep, Rave, Repeat\" /></a>\n" +
            "</div>";
    public static void main(String[] args)
    {
        String regexPattern = "<img[^>]*alt=[\"]*([\\w\\s-.:\\/,]+)[\"]*[^>]*/>";
        Pattern p = Pattern.compile(regexPattern);
        Matcher m = p.matcher(testStrings);
        if(m.find())
        {
            System.out.println(m.group(1));
        }

    }
}


Hope this was a quick help to get an idea how to use regex to retrieve info you want from raw HTML.