From 6b526a51865ba9f059302bdade7e4e12cd6ddf8d Mon Sep 17 00:00:00 2001 From: Jan Vales Date: Wed, 15 May 2019 07:08:36 +0200 Subject: [PATCH] Umformattiert. ausgebessert. Weitergeschrieben. bla. --- ex2/main_1.tex | 29 ++++++++++++++++++++++++----- ex2/main_2.tex | 23 ++++++++++++++++++----- ex2/main_3.tex | 34 ++++++++++++++++++++++++---------- 3 files changed, 66 insertions(+), 20 deletions(-) diff --git a/ex2/main_1.tex b/ex2/main_1.tex index 3472f1c..56aabf2 100644 --- a/ex2/main_1.tex +++ b/ex2/main_1.tex @@ -1,26 +1,45 @@ %ex2.1 -Map 53 53 -Reduce 1 1 - \begin{enumerate}[label=(\alph*)] - \item\begin{verbatim}rsync -vaPp --delete ~/gitstuff/adbs/ex2/mapreduce/ \ + \item\textbf{Write a MapReduce job which takes as input the checkout records and + computes the following:\\ + For each author, list the title that was checked out (borrowed) the most}\\ + + Compile and run: + \begin{verbatim}rsync -vaPp --delete ~/gitstuff/adbs/ex2/mapreduce/ \ e726236f@lbd.zserv.tuwien.ac.at:mapreduce/; \ ssh -t e726236f@lbd.zserv.tuwien.ac.at "cd mapreduce; ./build_run.sh '_ex1a' \ '/user/adbs/2019S/shared/seattle-checkouts-by-title/checkouts-by-title.csv'" \end{verbatim} + + Results:\\ http://localhost:19888/jobhistory/job/job\_1557406089646\_5204\\ Tasks: Map: 53\\ Reduce: 1\\ + HDFS: Number of bytes read: 7118531360\\ + HDFS: Number of bytes written: 28695222\\ + Map input records: 32723546\\ + Reduce output records: 324418\\ - \item\begin{verbatim}rsync -vaPp --delete ~/gitstuff/adbs/ex2/mapreduce/ \ + \item\textbf{Write a MapReduce job which takes as input both the library inventory + and the checkout records and computes the following:\\ + For each author, list the title that was checked out (borrowed) the most}\\ + + Compile and run: + \begin{verbatim}rsync -vaPp --delete ~/gitstuff/adbs/ex2/mapreduce/ \ e726236f@lbd.zserv.tuwien.ac.at:mapreduce/; \ ssh -t e726236f@lbd.zserv.tuwien.ac.at "cd mapreduce; ./build_run.sh '_ex2a' \ '/user/adbs/2019S/shared/seattle-checkouts-by-title/checkouts-by-title.csv' \ '/user/adbs/2019S/shared/seattle-library-collection-inventory/library-collection-inventory.csv'" \end{verbatim} + + Results:\\ http://localhost:19888/jobhistory/job/job\_1557406089646\_5309\\ Map: 109\\ Reduce: 1\\ + HDFS: Number of bytes read: 14621528936\\ + HDFS: Number of bytes written: 52724095\\ + Map input records: 55563180\\ + Reduce output records: 183399\\ \end{enumerate} diff --git a/ex2/main_2.tex b/ex2/main_2.tex index af19c84..b71f5df 100644 --- a/ex2/main_2.tex +++ b/ex2/main_2.tex @@ -1,28 +1,41 @@ %ex2.2 \begin{enumerate}[label=(\alph*)] - \item + \item\textbf{Suppose we do not use a combiner at the Map tasks. Do you expect there to be skew in + the times taken by the various reducers to process their value lists? Why or why not?}\\ + Yes. Some words (e.g. ''I'', ''a'', ''are'',..) will occur very often, the reducers handling that key will take very long to process all values. Other reducers handling less frequently used words will be finished in less time by far. - \item + + \item\textbf{If we combine the reducer functions into a small number of Reduce tasks, say 10 task, + at random, do you expect the skew to be significant? What if we instead combined the + reducers into 10.000 Reduce tasks?!}\\ + With less reducers the total time will go up, but the skew will be less, as the long lists are more likely to be moved to a reducer which had short lists before (as that one will be finished sooner) With 10000 reducers the skew will go up, as a few of those reducers will have to handle very long lists, while the others will finish early and idle. This is still assuming that we don't have a combiner. - \item + + \item\textbf{Suppose we do use a combiner at the 100 Map tasks. Do you expect the skew to be + significant? Why or why not?}\\ + It will be much less skew than if we don't use a combiner. A lot of words appear lots of times on a page, so pre-processing them with a combiner will reduce the number of values for the reducers significantly, and this also will reduce communication cost. - \item + + \item\textbf{ Suppose we increase the number of Reduce tasks significantly, say to 1.000.000. Do you + expect the communication cost to increase? What about the replication rate or reducer + size? Justify your answers.}\\ + Communication cost will increase, as the reducer size can be set to lower values. This will also reduce the skew in the times taken by the reducers. Replication rate will stay the same, as this is determined by mapped - values divided by inputs (which is about 2500 for an average webpage) + values divided by inputs (which is about 2500 for an average web-page) Reducer size can go down, in order to reduce reducer runtime skew. \end{enumerate} diff --git a/ex2/main_3.tex b/ex2/main_3.tex index b02c427..2724018 100644 --- a/ex2/main_3.tex +++ b/ex2/main_3.tex @@ -1,16 +1,30 @@ %ex2.3 \begin{enumerate}[label=(\alph*)] - \item + \item\textbf{Bag Union, defined to be the bag of tuples in which tuple t appears the sum of the + numbers of times it appears in R and S, e.g. if tuple t occurs in R 3 times, and in S 4 + times, then it should occur in the bag union R and S 7 times.}\\ + \textbf{Map function:} for every tuple t in R and for every tuple t in - S, produce the key-value pair (t,t);\\ - \textbf{Reduce funtion:} Receives input of the form (t; [t, t, …, t])\\ - \item + S, produce the key-value pair (t,t); + + \textbf{Reduce function:} Receives input of the form (t; [t, t, …, t])\\ + outputs (t,t) k-times\\ + + \item\textbf{Bag Difference, defined to be the bag of tuples in which the number of times a tuple + t appears is equal to the number of times it appears in R minus the number of times it + appears in S. A tuple that appears more times in S than in R does not appear in the + difference, e.g. if tuple t occurs in R 4 times, and in S 3 times, then it should occur in + the bag difference of R with S only once.}\\ + \textbf{Map function:} for every tuple t in R and for every tuple t in - S, produce the key-value pair (t; ''R'') or (t; ''S''), respectively.\\ - \textbf{Reduce function:} Receives input of the form (t; - [x_{1},…,x_{n}]), where each x_{i} is either the form of ''R'' or - ''S''. - Outputs (t,t) if the input contains ''R'' but not ''S''. - Outputs nothing otherwise.\\ + S, produce the key-value pair (t; ''R'') or (t; ''S''), respectively. + + \textbf{Reduce function:} Receives input of the form (t; [$x_{1}$,…,$x_{n}$]), where each $x_{i}$ is either the form of ''R'' or ''S''.\\ + Outputs (t,t) if the input contains ''R'' but not ''S''.\\ + Outputs nothing otherwise. + + \textbf{Alternative Reduce function:} Receives input of the form (t; [$x_{1}$,…,$x_{n}$]), where each $x_{i}$ is either the form of ''R'' or ''S''.\\ + Initialize a counter k = 0. For each ''R'' increment k. For each ''S'' decrement k.\\ + Outputs (t,t) k times, if k > 0. \end{enumerate} -- 2.43.0