IRC Logs for #circuits Monday, 2014-04-28

prologicI dunno :)00:00
prologicone of you buggers  :) lol00:00
prologicit has a builtin “die” command00:00
prologicdo you think I should build an auth and authorization plugin(s)?00:01
prologicit’s not like you can really cause any damage to it in any way :)00:01
prologichttps://pypi.python.org/pypi/tartpy00:24
prologicinteresting stuff00:24
prologicthis is the 2nd actor based concurrency model I’ve seen in Python00:24
robert_hai00:25
prologichi00:30
c45ywasn't me01:43
prologicsure it wasn’t :)01:43
prologickdb_: load drone01:43
b69904af95c0Loaded plugin: Drone01:43
kdbUnknown Command: _:01:43
prologicsomeone remind me to fix that01:45
c45ywhy are there 2 bots?01:45
prologicone runs on tutum.co01:47
prologicas a test of their infrastructure01:47
prologicthe other runs on my desktop at home01:48
c45yi see01:48
prologicas a test of home brewed infrastructure01:48
Romsterprologic, fixed orc02:42
Romsterafter line 157 if patterns.... verbose and log("  (P): {0}", _url)02:43
Romsteri added two lines to spyda02:43
Romsterif links == '':02:43
Romster                        links = queue.popleft()02:43
Romsterreasoning is when --patterns is use it will only crawl urls that match the pattern and not try to go to other sites.02:44
Romsterit somewhat works better but not ideal yet. while i do expect the first pass to see a bunch of (P) lines then after the first index page is done it still seems to follow onto other regexs not in --pattern02:45
Romsteri'm trying to make --pattern only follow urls i am interested in.02:45
Romster--pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$"02:46
Romsterfor example02:46
prologicactually this is incorrect02:48
prologic-p/—pattern is merely a grep like mechanism02:48
prologicfor “displaying”/“outputting” the urls of interest02:48
prologicsaves you from doing:02:48
prologiccrawl … | grep pattern02:48
prologicit does not affect the URI(s) the cralwer follows02:49
Romsterok what i want to do is restrict what patterns i search in urls.02:49
prologicAlso links IIRC will be a list02:49
prologicnot a str02:49
Romstermaybe it's best handled by another option?02:49
prologicYeah that’s -p/--pattern02:49
prologicHmm02:49
prologicwhat am I not being clear on?02:49
Romsterbut before i added those two lins it still went off to twitter and other ties.02:49
prologicif you only want to follow a certain pattenr of urls02:50
prologicthen02:50
Romstercorrect02:50
prologic—blacklist=“.*” —whitelist=“some_uri_pattenr"02:50
Romsterdid that it still goes to other sites.02:50
prologicit most definately should not02:50
prologicthat’s a bug if it followed blacklisted sites02:51
Romsterok so i should re removing urls not matching whitelist.02:51
Romsterwell i'm pretty sure it is.02:51
prologicplease confirm that this is a bug or not02:51
prologicspyda actually has 100% test coverage02:51
prologicso blacklisting and whitelisting is fully tested02:51
Romsterok it might be me doing something stupid too02:51
Romstercrawl -v --content-type --blacklist=".*" --whitelist="^http\:\/\/sourceforge\.net\/projects\/xine\/files\/.*$" --pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$" http://sourceforge.net/projects/xine/files/xine-lib/ |tee /tmp/sclist02:51
Romsteris my useage02:51
prologicremove --content-type02:52
prologicother than that, should be fine02:52
Romsteri think content type doe's handle this either02:52
RomsterContent-Type: text/html; charset=utf-802:52
prologicit should only follow URI(s) that start with http://sourceforge.net/projects/xine/files/02:53
Romster; delimiter with another option.02:53
Romsterhmm k02:53
prologicyeah I may have to expand content-type restrictions to allow for charsets02:53
prologicin this case I’d just ignore any charset02:53
Romsteryeah as we are not interested in knowing that.02:54
Romsterit is working better than it was before though02:54
prologicgood02:55
prologicyeah pease test the whitelist/blacklist stuf02:55
prologicit *should* work02:55
Romsterthe logic around all that area of the crawler is a little confusing02:55
Romsterbut i get the jist of how it works.02:56
Romsterah yeah whitelist urls to index makes more sense.02:56
Romsterhmm removing content type actually is an improvement.02:56
Romsterok that is working better but no actual file names are shown in my output02:57
Romsterhttps://gist.github.com/1136073102:58
Romstercrawl -v --blacklist=".*" --whitelist="^http\:\/\/sourceforge\.net\/projects\/xine\/files\/.*$" --pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$" http://sourceforge.net/projects/xine/files/xine-lib/ |tee /tmp/sclist02:58
Romsteri should be seeing a bunch of files in that list.02:58
ircnotifier23d1ccd93dd8 by prologic: Cleaned up list comprehensions02:59
Romsterbut that is working --content-type was messing it up02:59
prologicyeah03:00
Romstercontent type may be omiting them files in the output.03:00
Romsterbut all i wanted the content type for was to prevent it downloading non html/xml files in the crawler stage.03:01
Romsteri gotta get going i'll be on at later03:01
prologickk03:01
*** ninkotech has quit IRC03:24
*** ninkotech has joined #circuits03:24
kdbHello ninkotech03:24
b69904af95c0Yo ninkotech03:24
Romsterprologic, ok i'm back07:45
ninkotech:) hi kdb08:32
Romsterwith that many bots in here it's starting to be hard to tell whos a bot and whos not.08:42
epicmuffin<- not a bot09:21
epicmuffinwhatever Romster wants to do, here is json: http://sourceforge.net/projects/xine/files/xine-lib/list09:24
Romsterwas not aware of that but not all sites have that09:25
*** Osso has joined #circuits09:26
b69904af95c0Welcome back osso :)09:26
kdbWelcome back osso :)09:26
epicmuffinyes the path changes a bit: http://sourceforge.net/projects/openofficeorg.mirror/files//list09:27
Romsterthat's not good. why can't they be consistent.09:28
epicmuffinthey are :)09:28
epicmuffinhttp://sourceforge.net/projects/openofficeorg.mirror/files//list09:28
epicmuffinhttp://sourceforge.net/projects/xine/files//list09:28
epicmuffinproject/files/list09:28
epicmuffinor09:28
epicmuffinproject/files/folder/list09:28
epicmuffindepending on what you want09:29
Romsterthat's fine for sourceforge but doens't cover other sites09:29
Romsteri could probably use a upper level tool to check for a json file and optionally use that.09:30
*** ninkotech has quit IRC09:31
*** ninkotech has joined #circuits09:36
kdbWelcome back ninkotech :)09:36
b69904af95c0Welcome back ninkotech :)09:36
ninkotech;)09:36
ninkotech(crashed x.org, grr)09:37
*** Osso has quit IRC09:43
Romsteralways fun11:12
Romsterepicmuffin, thanks for that json url that's something i can add to my lsurl program.11:12
epicmuffin:)11:37
epicmuffinopened sf.net and firebug and looked at the console if there is any ajax going on11:37
Romsteri never even thought of looking at ajax11:59
*** FSX has joined #circuits12:14
kdbWelcome back fsx :)12:14
b69904af95c0Hello fsx12:14
prologichiya all13:05
prologicI'm available for about ~10mins13:05
prologicif any circuits q's I'm here for then before going to bed :)13:05
prologicok13:21
prologicnight all13:21
*** robert___ has joined #circuits19:40
*** robert___ has quit IRC19:40
*** robert___ has joined #circuits19:40
kdbHello robert___19:40
b69904af95c0Yo robert___19:40
*** ninkotech has quit IRC23:15
*** ninkotech has joined #circuits23:15
kdbWelcome back ninkotech :)23:15
b69904af95c0Welcome back ninkotech :)23:15

Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!