文件名称:
利用micemeat实现mapreduce
开发工具:
文件大小: 7mb
下载次数: 0
上传时间: 2013-04-18
详细说明: 最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。 具体描述如下: Programming Assignment for HW3 Homework 3 (Programming Assignment A) Download data files bundled as a .zip file from hw3data.zip Each file in this archive contains entries that look like: journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2. that represent bibliographic information about publications, formatted as follows: paper-id:::author1::author2: :…. ::authorN:::title Your task is to compute how many times every term occurs across titles, for each author. For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2. Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms. The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python. These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively. I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar. Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author. Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce. Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong. ...展开收缩
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.