`
peigang
  • 浏览: 166757 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

nutch学习笔记2.Injector 数据下载入口

 
阅读更多

org.apache.nutch.crawl.Injector

public class Injector extends Configured implements Tool 从继承类和实现接口可以看出,Injector封装了Hadoop并在构造函数中初始化Hadoop配置参数Configuration( Configuration 内部机制请参考博文hadoop学习笔记1.Configuration),这也是nutch封装Hadoop的一种机制。

Injector 包含两个属性:
/** metadata key reserved for setting a custom score for a specific URL */

 public static String nutchScoreMDName = "nutch.score";
/** metadata key reserved for setting a custom fetchInterval for a specific URL */
public static String nutchFetchIntervalMDName = "nutch.fetchInterval";

 

 

一个方法:

第一个参数  crawlDb为nutch抓取目录下crawlDb目录的路径;

第二个参数 urlDir为nutch抓取文件列表目录;

 

 

public void inject(Path crawlDb, Path urlDir)

 

 public void inject(Path crawlDb, Path urlDir) throws IOException {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
long start = System.currentTimeMillis();

//输出日志
if (LOG.isInfoEnabled()) {
LOG.info("Injector: starting at " + sdf.format(start));
LOG.info("Injector: crawlDb: " + crawlDb);
LOG.info("Injector: urlDir: " + urlDir);
}
//随机生成临时文件夹
  Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

// map text input file to a <url,CrawlDatum> file
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Converting injected urls to crawl db entries.");
}

//创建作业,并设置作业实现类InjectMapper.class 
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);
//设置设置map输出路径
FileOutputFormat.setOutputPath(sortJob, tempDir);

//设置作业参数
sortJob.setOutputFormat(SequenceFileOutputFormat.class);
sortJob.setOutputKeyClass(Text.class);
sortJob.setOutputValueClass(CrawlDatum.class);
sortJob.setLong("injector.current.time", System.currentTimeMillis());

//启动作业
JobClient.runJob(sortJob);

// merge with existing crawl db
if (LOG.isInfoEnabled()) {
LOG.info("Injector: Merging injected urls into crawl db.");
}


JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb);
FileInputFormat.addInputPath(mergeJob, tempDir);
mergeJob.setReducerClass(InjectReducer.class);
JobClient.runJob(mergeJob);
CrawlDb.install(mergeJob, crawlDb);

// clean up
FileSystem fs = FileSystem.get(getConf());
fs.delete(tempDir, true);

long end = System.currentTimeMillis();
LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
}

 

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics