Regex to find external links from the html file using grep

Posted by Amar on Stack Overflow See other posts from Stack Overflow or by Amar
Published on 2010-06-09T12:28:27Z Indexed on 2010/06/09 12:32 UTC
Read the original article Hit count: 575

Filed under:

regex

|

linux

|

grep

hello,

From past few days I'm trying to develop a regex that fetch all the external links from the web pages given to it using grep.

Here is my grep command

grep -h -o -e "\(\(mailto:\|\(\(ht\|f\)tp\(s\?\)\)\)\://\)\{1\}\(.*\?\)" "/mnt/websites_folder/folder_to_search" -r

now the grep seem to return everything after the external links in that given line

Example

if an html file contain something like this on same line

Google

Yahoo

then the given grep command return the following result

http://www.google.com">Google</a><p><a href='https://yahoo.com'>Yahoo</a></p>


the idea here is that if an html file contain more than one  links(`irrespective in a,img etc`) in same line then the regex should fetch only the links and not all content of that line

I managed to developed the same in rubular.com the regex is as follow

("|')(\b((ht|f)tps?:\/\/)(.*?)\b)("|')

with work with the above input but iam not able to replicate the same in grep can anyone help I can't modify the html file so don't ask me to do that neither I can look for each specific tags and check their attributes to to get external links as it addup processing time and my application doesn't demand that

Thank You

© Stack Overflow or respective owner

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More

Related posts about linux

apt-get install and update fail

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I've got a problem with apt-get update and apt-get install ... commands . every time update or installing fails and errors are : Get:1 http://dl.google.com stable Release.gpg [198B] Ign http://dl.google.com/linux/chrome/deb/ stable/main Translation-en_US Get:2 http://dl… >>> More
kernel module compiling error

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
sh@ubuntu:/home/ccpp/helloworld$ make gcc-4.6 -O2 -DMODULE -D_KERNEL_ -W -Wall -Wstrict-prototypes -Wmissing-prototypes -isystem /lib/modules/`uname -r`/build/include -c -o hello-1.o hello-1.c hello-1.c:4:0: warning: "MODULE" redefined [enabled by default] <command-line>:0:0: note: this is… >>> More
Build-Essentials installation failing

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I am having trouble accessing the several critical header files that show to be a part of the build process. The "Ubuntu Software Center" shows "Build Essentials" as installed: Next I did the following two commands, which did not improve the problem: ~$ sudo apt-get install build-essential [sudo]… >>> More
Updating Debian kernel

as seen on Super User - Search for 'Super User'
I'm trying to update my Debian machine to 2.6.32-46 (which is the new stable). However, after doing apt-get update my apt-cache search linux-image shows me: linux-headers-2.6.32-5-486 - Header files for Linux 2.6.32-5-486 linux-headers-2.6.32-5-686-bigmem - Header files for Linux 2.6.32-5-686-bigmem linux-headers-2… >>> More
Serial connection over a single USB cable (Windows to linux, or linux to linux)

as seen on Server Fault - Search for 'Server Fault'
I'm helping out with a project for an embedded device that only has USB and no serial. This device is running Linux. These days, when we need to connect to a serial port on a device we typically use a USB to serial adapter (on something like a phone system or a load balancing device, etc). I would… >>> More