prologic | I dunno :) | 00:00 |
---|---|---|
prologic | one of you buggers :) lol | 00:00 |
prologic | it has a builtin “die” command | 00:00 |
prologic | do you think I should build an auth and authorization plugin(s)? | 00:01 |
prologic | it’s not like you can really cause any damage to it in any way :) | 00:01 |
prologic | https://pypi.python.org/pypi/tartpy | 00:24 |
prologic | interesting stuff | 00:24 |
prologic | this is the 2nd actor based concurrency model I’ve seen in Python | 00:24 |
robert_ | hai | 00:25 |
prologic | hi | 00:30 |
c45y | wasn't me | 01:43 |
prologic | sure it wasn’t :) | 01:43 |
prologic | kdb_: load drone | 01:43 |
b69904af95c0 | Loaded plugin: Drone | 01:43 |
kdb | Unknown Command: _: | 01:43 |
prologic | someone remind me to fix that | 01:45 |
c45y | why are there 2 bots? | 01:45 |
prologic | one runs on tutum.co | 01:47 |
prologic | as a test of their infrastructure | 01:47 |
prologic | the other runs on my desktop at home | 01:48 |
c45y | i see | 01:48 |
prologic | as a test of home brewed infrastructure | 01:48 |
Romster | prologic, fixed orc | 02:42 |
Romster | after line 157 if patterns.... verbose and log(" (P): {0}", _url) | 02:43 |
Romster | i added two lines to spyda | 02:43 |
Romster | if links == '': | 02:43 |
Romster | links = queue.popleft() | 02:43 |
Romster | reasoning is when --patterns is use it will only crawl urls that match the pattern and not try to go to other sites. | 02:44 |
Romster | it somewhat works better but not ideal yet. while i do expect the first pass to see a bunch of (P) lines then after the first index page is done it still seems to follow onto other regexs not in --pattern | 02:45 |
Romster | i'm trying to make --pattern only follow urls i am interested in. | 02:45 |
Romster | --pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$" | 02:46 |
Romster | for example | 02:46 |
prologic | actually this is incorrect | 02:48 |
prologic | -p/—pattern is merely a grep like mechanism | 02:48 |
prologic | for “displaying”/“outputting” the urls of interest | 02:48 |
prologic | saves you from doing: | 02:48 |
prologic | crawl … | grep pattern | 02:48 |
prologic | it does not affect the URI(s) the cralwer follows | 02:49 |
Romster | ok what i want to do is restrict what patterns i search in urls. | 02:49 |
prologic | Also links IIRC will be a list | 02:49 |
prologic | not a str | 02:49 |
Romster | maybe it's best handled by another option? | 02:49 |
prologic | Yeah that’s -p/--pattern | 02:49 |
prologic | Hmm | 02:49 |
prologic | what am I not being clear on? | 02:49 |
Romster | but before i added those two lins it still went off to twitter and other ties. | 02:49 |
prologic | if you only want to follow a certain pattenr of urls | 02:50 |
prologic | then | 02:50 |
Romster | correct | 02:50 |
prologic | —blacklist=“.*” —whitelist=“some_uri_pattenr" | 02:50 |
Romster | did that it still goes to other sites. | 02:50 |
prologic | it most definately should not | 02:50 |
prologic | that’s a bug if it followed blacklisted sites | 02:51 |
Romster | ok so i should re removing urls not matching whitelist. | 02:51 |
Romster | well i'm pretty sure it is. | 02:51 |
prologic | please confirm that this is a bug or not | 02:51 |
prologic | spyda actually has 100% test coverage | 02:51 |
prologic | so blacklisting and whitelisting is fully tested | 02:51 |
Romster | ok it might be me doing something stupid too | 02:51 |
Romster | crawl -v --content-type --blacklist=".*" --whitelist="^http\:\/\/sourceforge\.net\/projects\/xine\/files\/.*$" --pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$" http://sourceforge.net/projects/xine/files/xine-lib/ |tee /tmp/sclist | 02:51 |
Romster | is my useage | 02:51 |
prologic | remove --content-type | 02:52 |
prologic | other than that, should be fine | 02:52 |
Romster | i think content type doe's handle this either | 02:52 |
Romster | Content-Type: text/html; charset=utf-8 | 02:52 |
prologic | it should only follow URI(s) that start with http://sourceforge.net/projects/xine/files/ | 02:53 |
Romster | ; delimiter with another option. | 02:53 |
Romster | hmm k | 02:53 |
prologic | yeah I may have to expand content-type restrictions to allow for charsets | 02:53 |
prologic | in this case I’d just ignore any charset | 02:53 |
Romster | yeah as we are not interested in knowing that. | 02:54 |
Romster | it is working better than it was before though | 02:54 |
prologic | good | 02:55 |
prologic | yeah pease test the whitelist/blacklist stuf | 02:55 |
prologic | it *should* work | 02:55 |
Romster | the logic around all that area of the crawler is a little confusing | 02:55 |
Romster | but i get the jist of how it works. | 02:56 |
Romster | ah yeah whitelist urls to index makes more sense. | 02:56 |
Romster | hmm removing content type actually is an improvement. | 02:56 |
Romster | ok that is working better but no actual file names are shown in my output | 02:57 |
Romster | https://gist.github.com/11360731 | 02:58 |
Romster | crawl -v --blacklist=".*" --whitelist="^http\:\/\/sourceforge\.net\/projects\/xine\/files\/.*$" --pattern="^.*sourceforge\.net\/projects\/xine\/files\/.*$" http://sourceforge.net/projects/xine/files/xine-lib/ |tee /tmp/sclist | 02:58 |
Romster | i should be seeing a bunch of files in that list. | 02:58 |
ircnotifier | 23d1ccd93dd8 by prologic: Cleaned up list comprehensions | 02:59 |
Romster | but that is working --content-type was messing it up | 02:59 |
prologic | yeah | 03:00 |
Romster | content type may be omiting them files in the output. | 03:00 |
Romster | but all i wanted the content type for was to prevent it downloading non html/xml files in the crawler stage. | 03:01 |
Romster | i gotta get going i'll be on at later | 03:01 |
prologic | kk | 03:01 |
*** ninkotech has quit IRC | 03:24 | |
*** ninkotech has joined #circuits | 03:24 | |
kdb | Hello ninkotech | 03:24 |
b69904af95c0 | Yo ninkotech | 03:24 |
Romster | prologic, ok i'm back | 07:45 |
ninkotech | :) hi kdb | 08:32 |
Romster | with that many bots in here it's starting to be hard to tell whos a bot and whos not. | 08:42 |
epicmuffin | <- not a bot | 09:21 |
epicmuffin | whatever Romster wants to do, here is json: http://sourceforge.net/projects/xine/files/xine-lib/list | 09:24 |
Romster | was not aware of that but not all sites have that | 09:25 |
*** Osso has joined #circuits | 09:26 | |
b69904af95c0 | Welcome back osso :) | 09:26 |
kdb | Welcome back osso :) | 09:26 |
epicmuffin | yes the path changes a bit: http://sourceforge.net/projects/openofficeorg.mirror/files//list | 09:27 |
Romster | that's not good. why can't they be consistent. | 09:28 |
epicmuffin | they are :) | 09:28 |
epicmuffin | http://sourceforge.net/projects/openofficeorg.mirror/files//list | 09:28 |
epicmuffin | http://sourceforge.net/projects/xine/files//list | 09:28 |
epicmuffin | project/files/list | 09:28 |
epicmuffin | or | 09:28 |
epicmuffin | project/files/folder/list | 09:28 |
epicmuffin | depending on what you want | 09:29 |
Romster | that's fine for sourceforge but doens't cover other sites | 09:29 |
Romster | i could probably use a upper level tool to check for a json file and optionally use that. | 09:30 |
*** ninkotech has quit IRC | 09:31 | |
*** ninkotech has joined #circuits | 09:36 | |
kdb | Welcome back ninkotech :) | 09:36 |
b69904af95c0 | Welcome back ninkotech :) | 09:36 |
ninkotech | ;) | 09:36 |
ninkotech | (crashed x.org, grr) | 09:37 |
*** Osso has quit IRC | 09:43 | |
Romster | always fun | 11:12 |
Romster | epicmuffin, thanks for that json url that's something i can add to my lsurl program. | 11:12 |
epicmuffin | :) | 11:37 |
epicmuffin | opened sf.net and firebug and looked at the console if there is any ajax going on | 11:37 |
Romster | i never even thought of looking at ajax | 11:59 |
*** FSX has joined #circuits | 12:14 | |
kdb | Welcome back fsx :) | 12:14 |
b69904af95c0 | Hello fsx | 12:14 |
prologic | hiya all | 13:05 |
prologic | I'm available for about ~10mins | 13:05 |
prologic | if any circuits q's I'm here for then before going to bed :) | 13:05 |
prologic | ok | 13:21 |
prologic | night all | 13:21 |
*** robert___ has joined #circuits | 19:40 | |
*** robert___ has quit IRC | 19:40 | |
*** robert___ has joined #circuits | 19:40 | |
kdb | Hello robert___ | 19:40 |
b69904af95c0 | Yo robert___ | 19:40 |
*** ninkotech has quit IRC | 23:15 | |
*** ninkotech has joined #circuits | 23:15 | |
kdb | Welcome back ninkotech :) | 23:15 |
b69904af95c0 | Welcome back ninkotech :) | 23:15 |
Generated by irclog2html.py 2.11.0 by Marius Gedminas - find it at mg.pov.lt!