Hacking Restricted Webkit bug finder

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
New version here:http://www.mediafire.com/download/c1mvzc0fsoi55cf/wbf_v0.4.rar
this is only the regular python script, I will build a windows executable and update OP tomorrow after work.

This is probably the last update I will be doing to this.

To anyone still using this that is using the file hosting, this is an important update.

Changes:

The old file hosting method was only allowing dependencies within the same directory to be found, any that were outside would fail. Also I was not properly returning from the server thread until the main thread terminated causing a buildup of threads until you exit the program. This fixes that. I also completely changed the way it was hosting the files and re-worked it to serve in the root of LayoutTests and create a single index.html that is rewritten each time with a javascript redirect to the proper file. This has worked in all my tests finding dependent files.

The log parser was modified to capture urls better. By doing so I found 25 new restricted bugs. These are included in the new database provided. Also, scanning for restricted bugs now happens in a separate thread so the ui doesn't become unresponsive. I also added some console output while scanning. Every 50 attempts it will print the number of urls left to scan and how many restricted bugs have been found so far.

No longer need to manually strip a svn log, it automatically stops parsing to when it reaches 10/15/2012

I have attached a txt file below that contains a list of the 25 new bugs that were found. I haven't looked into any of them and only did a database comparison to find which ones were new. I forgot to include this in the rar file.
 

Attachments

  • new_bugs.txt
    200 bytes · Views: 300

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
New version here:http://www.mediafire.com/download/c1mvzc0fsoi55cf/wbf_v0.4.rar
this is only the regular python script, I will build a windows executable and update OP tomorrow after work.

This is probably the last update I will be doing to this.

To anyone still using this that is using the file hosting, this is an important update.

Changes:

The old file hosting method was only allowing dependencies within the same directory to be found, any that were outside would fail. Also I was not properly returning from the server thread until the main thread terminated causing a buildup of threads until you exit the program. This fixes that. I also completely changed the way it was hosting the files and re-worked it to serve in the root of LayoutTests and create a single index.html that is rewritten each time with a javascript redirect to the proper file. This has worked in all my tests finding dependent files.

The log parser was modified to capture urls better. By doing so I found 25 new restricted bugs. These are included in the new database provided. Also, scanning for restricted bugs now happens in a separate thread so the ui doesn't become unresponsive. I also added some console output while scanning. Every 50 attempts it will print the number of urls left to scan and how many restricted bugs have been found so far.

No longer need to manually strip a svn log, it automatically stops parsing to when it reaches 10/15/2012

I have attached a txt file below that contains a list of the 25 new bugs that were found. I haven't looked into any of them and only did a database comparison to find which ones were new. I forgot to include this in the rar file.


Are you using the changelog to parse the data? I've downloaded Webkit with SVN and just want to make sure I'm using the write logfile
 

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Are you using the changelog to parse the data? I've downloaded Webkit with SVN and just want to make sure I'm using the write logfile
Yes its the changelog. I use linux but i obtained it by installing subversion and in terminal navigate to you WebKit directory and running:
Code:
svn log > log.txt

Edit: its the svn commit log. Not. sure if thats different than an official changelog
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
Yes its the changelog. I use linux but i obtained it by installing subversion and in terminal navigate to you WebKit directory and running:
Code:
svn log > log.txt

Edit: its the svn commit log. Not. sure if thats different than an official changelog


Thanks, I figured it out. I changed your script a bit in the log parsing. It now grabs multiple urls in the same revision, I also added some timers in your html parser to do some time approximation for downloads. Basically it runs a timer for every 50 downloads and than uses that to estimate the time remaining. I'm looking for ways to speed it up but it seems that were stuck with speed being based mostly on the download rate.
 

Attachments

  • new_test_parser.zip
    5.5 KB · Views: 87
  • Like
Reactions: dojafoja

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Thanks, I figured it out. I changed your script a bit in the log parsing. It now grabs multiple urls in the same revision, I also added some timers in your html parser to do some time approximation for downloads. Basically it runs a timer for every 50 downloads and than uses that to estimate the time remaining. I'm looking for ways to speed it up but it seems that were stuck with speed being based mostly on the download rate.
So I looked it over and I like what you did, I honestly have never used regular expressions and don't really understand it but your code was easy to follow so thank you. I learned everything I know by studying peoples source and googling stuff. I took your version and integrated most of the changes of my newest version. As far as speeding up the scanning, I had an idea once but never implemented it. Basically write a threading daemon and divide all the urls to scan into multiple threads and run like 5-10 threads at once. What do you think?

EDIT: Also, since I don't really understand re could you have your parser stop parsing anything prior to 10/16/2012, similar to what I did in my newest v0.4 I posted, but using re?
 

Attachments

  • new_test_parser2.zip
    5.7 KB · Views: 82

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Found a slight error in your code here:
Code:
for url in urllist:
          url = urllib2.urlopen(url)
          html = url.read()
          soup = BeautifulSoup(html)
          title_tag = soup.findAll("title")
          for i in title_tag:
              x=str(i)
                  if 'Access Denied' in x:
                      urls_found += 1
                      print "Restriced URL Found"
                      self.denied_urls.append(url)

You were appending the instance of urllib2.urlopen(url) to self.denied_urls instead of the url string itself

A quick fix would be to rename the instance of this in the for loop to something like url2, and then of course url2.read()
 
  • Like
Reactions: Damieh79

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
Found a slight error in your code here:
Code:
for url in urllist:
          url = urllib2.urlopen(url)
          html = url.read()
          soup = BeautifulSoup(html)
          title_tag = soup.findAll("title")
          for i in title_tag:
              x=str(i)
                  if 'Access Denied' in x:
                      urls_found += 1
                      print "Restriced URL Found"
                      self.denied_urls.append(url)

You were appending the instance of urllib2.urlopen(url) to self.denied_urls instead of the url string itself

A quick fix would be to rename the instance of this in the for loop to something like url2, and then of course url2.read()

Thanks,
That explains the error when dropped out of the loop. It takes roughly 10 hours to run through the whole list. I was thinking about multi-threading and throwing 10 connections at a time. Thoughts?

EDIT: Saw the post above this just now and realized we're on the same page. Multithreading is probably the way to get the most efficiency out of it.
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
So I looked it over and I like what you did, I honestly have never used regular expressions and don't really understand it but your code was easy to follow so thank you. I learned everything I know by studying peoples source and googling stuff. I took your version and integrated most of the changes of my newest version. As far as speeding up the scanning, I had an idea once but never implemented it. Basically write a threading daemon and divide all the urls to scan into multiple threads and run like 5-10 threads at once. What do you think?

EDIT: Also, since I don't really understand re could you have your parser stop parsing anything prior to 10/16/2012, similar to what I did in my newest v0.4 I posted, but using re?
So I looked it over and I like what you did, I honestly have never used regular expressions and don't really understand it but your code was easy to follow so thank you. I learned everything I know by studying peoples source and googling stuff. I took your version and integrated most of the changes of my newest version. As far as speeding up the scanning, I had an idea once but never implemented it. Basically write a threading daemon and divide all the urls to scan into multiple threads and run like 5-10 threads at once. What do you think?

EDIT: Also, since I don't really understand re could you have your parser stop parsing anything prior to 10/16/2012, similar to what I did in my newest v0.4 I posted, but using re?


Python Regular Expressions aren't really conventional anyways, the syntax itself works but the structure is different than your typical Perl like.

I'll add in some basic definitions about what they are looking for and definitely have it stop prior that date.
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
So I looked it over and I like what you did, I honestly have never used regular expressions and don't really understand it but your code was easy to follow so thank you. I learned everything I know by studying peoples source and googling stuff. I took your version and integrated most of the changes of my newest version. As far as speeding up the scanning, I had an idea once but never implemented it. Basically write a threading daemon and divide all the urls to scan into multiple threads and run like 5-10 threads at once. What do you think?

EDIT: Also, since I don't really understand re could you have your parser stop parsing anything prior to 10/16/2012, similar to what I did in my newest v0.4 I posted, but using re?


Add re code to match on dates. If you wanted to add a configurable text box on the GUI you could add user specified end dates, just check to make sure they meet "YYYY-MM-DD" format. It also now supports multi-threading and spawns 10 threads on each pass. Anymore, and I started throwing SSL errors. It reduced the 10 hours down to 2 for 32000+ bugs which is a significant improvement.
 

Attachments

  • new_test_parser2.zip
    6.4 KB · Views: 86

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Add re code to match on dates. If you wanted to add a configurable text box on the GUI you could add user specified end dates, just check to make sure they meet "YYYY-MM-DD" format. It also now supports multi-threading and spawns 10 threads on each pass. Anymore, and I started throwing SSL errors. It reduced the 10 hours down to 2 for 32000+ bugs which is a significant improvement.

Man that was fast! Thanks for the detailed explanation in the code on whats going on with re. After reading your comments I could follow it but it still seems a bit wild lol. You cranked that out and brought the scan time to 1/5 what it was, thats great!! I will definitely allow a user supplied end date on the gui end. I haven't tested anything yet but the code looks awesome. Thanks again so much, I don't have a ton of time right now because of work and an android development project I'm doing using kivy for python. Which is pretty cool by the way for quick to crank out android apps. There's even an apk builder called buildozer, pretty cool.
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
Man that was fast! Thanks for the detailed explanation in the code on whats going on with re. After reading your comments I could follow it but it still seems a bit wild lol. You cranked that out and brought the scan time to 1/5 what it was, thats great!! I will definitely allow a user supplied end date on the gui end. I haven't tested anything yet but the code looks awesome. Thanks again so much, I don't have a ton of time right now because of work and an android development project I'm doing using kivy for python. Which is pretty cool by the way for quick to crank out android apps. There's even an apk builder called buildozer, pretty cool.


Thanks,

I'm still working through some of the code and cleaning my stuff up. I'll try and get a final product out tonight so you have as much time to tweak it as possible. I'll add more comments throughout to clarify what i'm doing. Basically, my goal is to get this down around an hour and a half for a full scan.
 
  • Like
Reactions: dojafoja

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Thanks,

I'm still working through some of the code and cleaning my stuff up. I'll try and get a final product out tonight so you have as much time to tweak it as possible. I'll add more comments throughout to clarify what i'm doing. Basically, my goal is to get this down around an hour and a half for a full scan.
You are a bada**, do whatever you want, a contibution like that is huge!
 

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
I hate to keep doing this :P I found this little change here that causes a database to not get created if the database doesn't already exist. I thought I would make you aware if you haven't picked up on it already. Everything else seems to rock so far!

Code:
if os.path.isfile('commits.db'): # There cannot be an existing file named 'commits.db if you plan to parse a new log.'
            #message.showerror('error','A file named commits.db already exists, please rename or move your old database file.')
            #raise Exception
            os.remove('commits.db')
        #else:
            bugs = ''
            rvn = ''
            url = []
            stop_date = "2012-10-16"
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
I hate to keep doing this :P I found this little change here that causes a database to not get created if the database doesn't already exist. I thought I would make you aware if you haven't picked up on it already. Everything else seems to rock so far!

Code:
if os.path.isfile('commits.db'): # There cannot be an existing file named 'commits.db if you plan to parse a new log.'
            #message.showerror('error','A file named commits.db already exists, please rename or move your old database file.')
            #raise Exception
            os.remove('commits.db')
        #else:
            bugs = ''
            rvn = ''
            url = []
            stop_date = "2012-10-16"

Yeah, I had it commented out when I was tweaking it. Right now, I am having trouble with:

Code:
db.commit()
db.close()
root.html_thread = False
message.showinfo(title="Complete", message="Scanning for restricted bugs is complete")

both root.html_thread and message.showinfo are hanging.
 
  • Like
Reactions: Damieh79

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
Yeah, I had it commented out when I was tweaking it. Right now, I am having trouble with:

Code:
db.commit()
db.close()
root.html_thread = False
message.showinfo(title="Complete", message="Scanning for restricted bugs is complete")

both root.html_thread and message.showinfo are hanging.

dojafoja
I've got it throttling pretty high now but it hits those statements and hangs. If commented out and I go to the 2nd tab, I get error's. I can't figure out what those are doing that are causing a hang.

Attached is the current build. I've got test settings on it to limit the checks down to verify everything works. Take a look and adjust your settings to find your best setup. Maybe another pair of eyes on it will help.
 

Attachments

  • new_test_parser2.zip
    6.6 KB · Views: 78

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
dojafoja
I've got it throttling pretty high now but it hits those statements and hangs. If commented out and I go to the 2nd tab, I get error's. I can't figure out what those are doing that are causing a hang.

Attached is the current build. I've got test settings on it to limit the checks down to verify everything works. Take a look and adjust your settings to find your best setup. Maybe another pair of eyes on it will help.
Maybe in the morning I will have time to really go over it, I think my wife has had enough of me being on the laptop this week :P. Basically all that root.html_thread = False is doing is resetting a value that is checked when the user clicks the scan button. It was to prevent multiple scans from starting if the user clicked the scan button multiple times without letting the previous scan complete. About the tkinter messagebox, I had to completely remove that part in my v0.4 version because Tkinter is not thread safe. Once I started putting things in seperate threads it would always hang when a external thread tried to generate a tk messagebox. Only the thread in which Tk was instantiated can call these. I tried everything I could think of, I had the external threads generate a virtual event, then bind the tk instance to the virtual event and have the messagebox called from my Root class and even this would hang. I tried having the external thread put the messagebox call into a queue using pythons Queue module and then have the main thread periodically check the queue and call the messagebox but that would hang too. IDK? I did a dirty little hack in my file hosting thread to get a messagebox when the index.html was successfully created. In the main thread when the button was clicked I ran a While loop waiting for the external thread to change a particular value, at which point I would break the loop and call the messagebox. Its a dirty hack but it sort of works.
 

Onion_Knight

Well-Known Member
Member
Joined
Feb 6, 2014
Messages
878
Trophies
0
Age
45
XP
997
Country
Maybe in the morning I will have time to really go over it, I think my wife has had enough of me being on the laptop this week :P. Basically all that root.html_thread = False is doing is resetting a value that is checked when the user clicks the scan button. It was to prevent multiple scans from starting if the user clicked the scan button multiple times without letting the previous scan complete. About the tkinter messagebox, I had to completely remove that part in my v0.4 version because Tkinter is not thread safe. Once I started putting things in seperate threads it would always hang when a external thread tried to generate a tk messagebox. Only the thread in which Tk was instantiated can call these. I tried everything I could think of, I had the external threads generate a virtual event, then bind the tk instance to the virtual event and have the messagebox called from my Root class and even this would hang. I tried having the external thread put the messagebox call into a queue using pythons Queue module and then have the main thread periodically check the queue and call the messagebox but that would hang too. IDK? I did a dirty little hack in my file hosting thread to get a messagebox when the index.html was successfully created. In the main thread when the button was clicked I ran a While loop waiting for the external thread to change a particular value, at which point I would break the loop and call the messagebox. Its a dirty hack but it sort of works.

I solved the first one and commented out the messagebox. Now I'm on to the next thing. Log and HTML parsing is good, but my changes modified the table for the database. I just have to re-write the querying.

My wife just rolls her eyes, but doesn't say anything. She knows that I'll be dreaming code all night anyway.
 
  • Like
Reactions: dojafoja

dojafoja

life elevated
OP
Member
Joined
Jan 2, 2014
Messages
696
Trophies
1
XP
2,608
Country
Does root.html_thread=False really hang by itself without the messagebox on your machine? Also using print like you did generated the messagebox on my machine
 

Site & Scene News

Popular threads in this forum

General chit-chat
Help Users
  • No one is chatting at the moment.
    Sonic Angel Knight @ Sonic Angel Knight: :ninja: