LOOK FOR ANYTHING IN QUOTES:
#############################
This goes well with my new article: amazing way to extract links from raw html
cat text.html | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’
The grep expresion is really simply
Look at this and open it up (remember how quotes are evaled in bash – when a quote is met it looks for the same type of quote to close it, single quotes can only open and close each other same with double quotes – double quotes can only open and close each other)
Space them apart, then look between each set of quotes
‘[‘ “‘” ‘”][^”‘ “‘” ‘]*[‘ “‘” ‘”]’
[ ‘ “][^” ‘ ]*[ ‘ “]
Bring that last together, removing all of the spaces. So the final expression is
[‘”][^”‘]*[‘”]
Thats what grep will operate on
In grep that means:
[‘”] Look for ‘ or this ”
[^”‘]* Then look for alot of charcters that are not this ” or ‘ (quotes dont have quotes inside them – unless they are escaped which this doesnt take into account)
[‘”] Then for the ending quotes (logically they will be the ending quotes)
Note: the -o part in grep will only output the part it matched (not the rest of the line)
Try instead of -o: –color, that way the text is kept it will just color what it finds as red. Also -o is the same as –only-matching
Extract links out of html
##########################
wget -O – http://stackoverflow.com | grep -o ‘<a href=[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | sed -e ‘s/^<a href=[“‘”‘”‘]//’ -e ‘s/[“‘”‘”‘]$//’
cat text.html | grep -o ‘<a href=[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | sed -e ‘s/^<a href=[“‘”‘”‘]//’ -e ‘s/[“‘”‘”‘]$//’
Extract links if you know a part of it
#######################################
This only works if your link/url is surrounded by single or double quotes
cat text.html | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | grep “www.infotinks.com”
wget -O – http://www.infotinks.com | grep -o ‘[‘”‘”‘”][^”‘”‘”‘]*[‘”‘”‘”]’ | grep “www.infotinks.com”
Another method
###############
Only works on links that begin with https,http. This mess gets extra stuff sometimes.
cat text.html | grep –only-matching –perl-regexp “http(s?):\/\/[^ \”\(\)\<\>]*”