Guia Contador de Palabras Cloudera

UNIVERSIDAD MARIANA
FACULTAD DE INGENIERÍA
PROGRAMA DE INGENIERÍA DE SISTEMAS
ELECTIVA BIG DATA
GUIA CLOUDERA
Downloading and Installing the Cloudera VM Instructions (Mac)
Learning Goals
In this activity, you will:
 Download and Install VirtualBox.

 Download and Install Cloudera Virtual Machine (VM) Image.
 Launch the Cloudera VM.
Hardware Requirements: (A) Quad Core Processor (VT-x or AMD-V support recommended), 64-
bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: Open Overview
by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM
purchased in the last 3 years will meet the minimum requirements. You will need a high speed
internet connection because you will be downloading files up to 4 Gb in size.
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with
VirutalBox before proceeding to the Getting Started with the Cloudera VM Environment video.
The screenshots are from a Mac but the instructions should be the same for Windows. Please
see the discussion boards if you have any issues.
1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install

VirtualBox for your computer. The course uses Virtualbox 5.1.X, so we recommend
clicking VirtualBox 5.1 builds on that page and downloading the older package for ease of
following instructions and screenshots. However, it shouldn't be too different if you choose to
use or upgrade to VirtualBox 5.2.X.
2. Download the Cloudera VM. Download the Cloudera VM

from https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-
virtualbox.zip. The VM is over 4GB, so will take some time to download
3. Unzip the Cloudera VM:
On Mac: Double click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip
On Windows: Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract

All…”
4. Start VirtualBox.
5. Begin importing. Import the VM by going to File -> Import Appliance
6. Click the Folder icon.

7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you
unzipped the VirtualBox VM and click Open.
8. Click Continue to proceed.

9. Click Import.
10. The virtual machine image will be imported. This can take several minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will
appear on the left in the VirtualBox window. Select it and click the Start button to launch the
VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The
booting process takes a long time since many Hadoop tools are started.
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear
with a browser.
Downloading and Installing the Cloudera VM Instructions (Windows)
Learning Goals
In this activity, you will:
 Download and Install VirtualBox.

 Download and Install Cloudera Virtual Machine (VM) Image.
 Launch the Cloudera VM.
Hardware Requirements: (A) Quad Core Processor (VT-x or AMD-V support recommended),
64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: Open
System by clicking the Start button, right-clicking Computer, and then clicking Properties. Most
computers with 8 GB RAM purchased in the last 3 years will meet the minimum
requirements.You will need a high speed internet connection because you will be downloading
files up to 4 Gb in size.
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with
VirutalBox before proceeding to the Getting Started with the Cloudera VM Environment video.
The screenshots are from a Mac but the instructions should be the same for Windows. Please
see the discussion boards if you have any issues.
1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install

VirtualBox for your computer. The course uses Virtualbox 5.1.X, so we recommend
clicking VirtualBox 5.1 builds on that page and downloading the older package for ease of
following instructions and screenshots. However, it shouldn't be too different if you choose to
use or upgrade to VirtualBox 5.2.X. For Windows, select the link "VirtualBox 5.1.X for
Windows hosts x86/amd64" where 'X' is the latest version.
2. Download the Cloudera VM. Download the Cloudera VM

fromhttps://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-
virtualbox.zip. The VM is over 4GB, so will take some time to download.
3. Unzip the Cloudera VM:
Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”

4. Start VirtualBox.
5. Begin importing. Import the VM by going to File -> Import Appliance
6. Click the Folder icon.

7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you
unzipped the VirtualBox VM and click Open.
8. Click Next to proceed.

9. Click Import.
10. The virtual machine image will be imported. This can take several minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will
appear on the left in the VirtualBox window. Select it and click the Start button to launch the
VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The
booting process takes a long time since many Hadoop tools are started.
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear
with a browser.
Completado
FAQ
 I am getting a failed to import error when I try to import Cloudera into Virtual Box?
The PC BIOS needs to be set to allow virtual technology.
 My screen keeps freezing even though virtual machine is in 'running state'?
Often times the virtual machine takes time to finish booting. If it takes more than five minutes
try restarting the virtual machine.
 A message keeps saying that name mode is in safe mode so the file cannot be
created?
After opening the Virtual Box, right click on the VM and look for "Normal Start."
Copy your data into the Hadoop Distributed File System (HDFS)
Instructions
Learning Goals
By the end of this activity, you will be able to:
 Interact with Hadoop using the command-line application.

 Copy files into and out of the Hadoop Distributed File System (HDFS).
Instructions
1. Open a browser. Open the browser by click on the browser icon on the top left of the
screen.
2. Download the Shakespeare. We are going to download a text file to copy into HDFS. Enter
the following link in the
browser: http://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
Once the page is loaded, click on the Open menu button.

Click on Save Page
Change the output to words.txt and click Save.
2. Open a terminal shell. Open a terminal shell by clicking on the square black box on the top
left of the screen.
Run cd Downloads to change to the Downloads directory.
Run ls to see that words.txt was saved.
3. Copy file to HDFS. Run hadoop fs –copyFromLocal words.txt to copy the text file to HDFS.
4. Verify file was copied to HDFS. Run hadoop fs –ls to verify the file was copied to HDFS.
5. Copy a file within HDFS. You can make a copy of a file in HDFS. Run hadoop fs -cp
words.txt words2.txt to make a copy of words.txt called words2.txt
We can see the new file by running hadoop fs -ls
6. Copy a file from HDFS. We can also copy a file from HDFS to the local file system.
Run hadoop fs -copyToLocal words2.txt . to copy words2.txt to the local directory.
Let's run ls to see that the file was copied to see that words2.txt is there.
7. Delete a file in HDFS. Let's the delete words2.txt in HDFS. Run hadoop fs -rm words2.txt
Run hadoop fs -ls to see that the file is gone.
Completado
En esta guía aprenderemos sobre los comandos básicos de manipulación de

archivos para el Sistema de archivos Hadoop o HDFS. Primero comenzaremos
descargando un archivo de texto de palabras. Usaremos este archivo de texto para
copiar hacia y desde el sistema de archivos local en HDFS y luego lo usaremos para
ejecutar el conteo de palabras.
Después de descargar el archivo de texto, abriremos un Shell de terminal y
copiaremos el archivo de texto del sistema de archivos local a HDFS.
A continuación, copiaremos el archivo dentro de HDFS y también veremos cómo
copiar el archivo de HDFS al sistema de archivos local. Finalmente, veremos cómo
eliminar un archivo en HDFS.
Empecemos.
Vamos a descargar el archivo de texto para copiarlo en HDFS. No importa cuál sea
el contenido del archivo de texto, por lo que descargaremos los trabajos completos
de Shakespeare ya que contiene texto interesante.
 Primero, inicia un navegador web.
 Ahora buscaremos en Google las obras completas de Shakespeare.
 Este trabajo completo de Shakespeare lo guardaremos en un archivo de texto
en el sistema de archivos local y lo llamaremos words.txt. Y la carpeta
Guardar en, la guardará en el directorio de Descargas.
 Una vez que se complete, abriremos una ventana de terminal
 Entonces, si entramos en el Directorio de descargas escribiendo un cd
Downloads y ejecutando el comando ls, podemos ver que words.txt se
descargó con éxito.
 Continuando, copiemos words.txt del sistema de archivos local al sistema de
archivos HDFS. El comando para hacer esto es hadoop fs- copyFromLocal
words.txt.
 Cuando ejecuto esto, lo copiará del directorio local y del sistema de archivos
local a HDFS.
 Podemos ver que el archivo se copió ejecutando hadoop fs -ls. Puede ver
que el archivo se copió correctamente.
 A continuación, podemos copiar este archivo a otro archivo dentro de HDFS.
Podemos hacer esto ejecutando hadoop fs -cp words.txt words2.txt. Las
primeras words.txt es el archivo que ya existe en HDFS.
 La segunda words2.txt es el nuevo archivo que vamos a crear cuando
ejecutemos este comando.
 Ejecútelo, y nuevamente podemos ejecutar hadoop fs -ls para ver los
archivos en HDFS. Podemos ver el archivo original words.txt, y la copia que
se hizo, words2.txt.
 Copiemos words2.txt de HDFS al sistema de archivos local. Podemos hacer
esto ejecutando hadoop fs -copyToLocal words2.txt
 Después de ejecutar este comando, puedo llamar a ls para ver el contenido
del sistema de archivos local.
 Así que ahora tenemos el nuevo archivo words2.txt que acabamos de copiar
de HDFS.
 El último paso en esta guía es eliminar un archivo en HDFS. Podemos
eliminar words2.txt ejecutando hadoop fs, pero es rn words2.txt.
 Como puede ver, imprimió que eliminó el archivo. También podemos ejecutar
hadoop fs- ls, para verificar que el archivo se haya eliminado.
Puede ver que solo está el archivo words.txt, y words2.txt se eliminó.
Run the WordCount program Instructions
Learning Goals
By the end of this activity, you will be able to:
 Execute the WordCount application.

 Copy the results from WordCount out of HDFS.
1. Open a terminal shell. Start the Cloudera VM in VirtualBox, if not already running, and
open a terminal shell. Detailed instructions for these steps can be found in the previous
Readings.
2. See example MapReduce programs. Hadoop comes with several example MapReduce
applications. You can see a list of them by running hadoop jar /usr/jars/hadoop-
examples.jar. We are interested in running WordCount.
The output says that WordCount takes the name of one or more input files and the name of the
output directory. Note that these files are in HDFS, not the local file system.
3. Verify input file exists. In the previous Reading, we downloaded the complete works of
Shakespeare and copied them into HDFS. Let's make sure this file is still in HDFS so we can
run WordCount on it. Run hadoop fs -ls
4. See WordCount command line arguments. We can learn how to run WordCount by
examining its command-line arguments. Run hadoop jar /usr/jars/hadoop-examples.jar
wordcount.
5. Run WordCount. Run WordCount for words.txt: hadoop jar /usr/jars/hadoop-examples.jar

wordcount words.txt out
As WordCount executes, the Hadoop prints the progress in terms of Map and Reduce. When
the WordCount is complete, both will say 100%.
6. See WordCount output directory. Once WordCount is finished, let's verify the output was
created. First, let's see that the output directory, out, was created in HDFS by running hadoop
fs –ls
We can see there are now two items in HDFS: words.txt is the text file that we previously
created, and out is the directory created by WordCount.
7. Look inside output directory. The directory created by WordCount contains several files.
Look inside the directory by running hadoop –fs ls out
The file part-r-00000 contains the results from WordCount. The file _SUCCESS means
WordCount executed successfully.
8. Copy WordCount results to local file system. Copy part-r-00000 to the local file system
by running hadoop fs –copyToLocal out/part-r-00000 local.txt
9. View the WordCount results. View the contents of the results: more local.txt
Each line of the results file shows the number of occurrences for a word in the input file. For
example, Accuse appears four times in the input, but Accusing appears only once.
Completado
En esta guía usaremos Hadoop para ejecutar WordCount. Primero abriremos un

terminal shell y exploraremos los programas MapReduce proporcionados por
Hadoop. A continuación, verificaremos que el archivo de entrada existe en HDFS.
Luego ejecutaremos WordCount y exploraremos el directorio de salida de
WordCount. Después de eso, copiaremos los resultados de WordCount de HDFS al
sistema de archivos local y los veremos. Vamos a empezar. Primero abriremos un
shell de terminal haciendo clic en el icono en la parte superior de la ventana aquí.
A continuación, veremos los programas producidos en el mapa que vienen con
Hadoop. Podemos hacer esto ejecutando Hadoop, jars user jars, ejemplos de
Hadoop .jar.
Este comando dice que vamos a usar el comando jar para ejecutar un programa en
Hadoop desde un archivo jar. Y el archivo jar desde el que estamos ejecutando está
en /usr/jars/hadoop-examples.jar. Muchos programas escritos en Java se
distribuyen a través de archivos jar. Si ejecutamos este comando, veremos una lista
de diferentes programas que vienen con Hadoop. Entonces, por ejemplo,
wordcount. Cuenta las palabras en un archivo de texto.
Wordmean, cuenta la longitud promedio de las palabras. Y otros programas, como
ordenar y calcular la longitud de pi.
En la guía anterior, descargamos las Obras de Shakespeare y las guardamos en
HDFS. Asegurémonos de que el archivo sigue ahí ejecutando hadoop fs -ls.
Podemos ver que el archivo todavía está allí, y se llama words.txt.
Podemos ejecutar wordcount ejecutando hadoop jar /usr/jars/hadoop-examples.jar
wordcount. Este comando dice que vamos a ejecutar un jar, y este es el nombre del
jar que contiene el programa. Y el programa que vamos a ejecutar es Wordcount.
Cuando lo ejecutamos, vemos que imprime el uso de la línea de comandos sobre
cómo ejecutar el conteo de palabras. Esto dice que wordcount toma uno o más
archivos de entrada y un nombre de salida. Ahora, tanto la entrada como la salida
se encuentran en HDFS. Entonces tenemos el archivo de entrada que acabamos
de enumerar, words.txt, en HDFS. Podemos ejecutar wordcount.
Entonces ejecutaremos hadoop jar / usr / jars / hadoop-examples.jar wordcount
words.txt out.
Esto significa que vamos a ejecutar el programa WordCount usando words.txt como
entrada y colocaremos la salida en un directorio llamado.
Entonces lo correremos.
A medida que se ejecuta el conteo de palabras, sus impresiones avanzan a la
pantalla.
Imprimirá el porcentaje de mapa y reducirá completado. Y cuando ambos alcanzan
el 100%, entonces el trabajo está hecho.
Ahora que el trabajo está completo, veamos los resultados.
Podemos ejecutar Hadoop fs -ls para ver el resultado.
Esto muestra que out fue creado y aquí es donde se almacenan nuestros resultados.
Observe que es un directorio con una d aquí. Entonces, el conteo de palabras
hadoop creó el directorio para contener la salida. Miremos dentro de ese directorio
ejecutando Hadoop fs- ls. [AUDIO EN BLANCO] Podemos ver que hay dos archivos
en este directorio. El primero es _SUCCESS, esto significa que el trabajo de
WordCount se completó correctamente. El otro archivo part-r-00000 es un archivo
de texto que contiene la salida del comando WordCount
Ahora copiemos este archivo de texto al sistema de archivos local desde HDFS y
luego véalo. Podríamos copiarlo ejecutando hadoop fs -copytolocal out / part-r-
00000 local.
Y diremos que local.txt es el nombre. Puedes ver los resultados de esto. Estamos
ejecutando más local.txt.
Esto verá el contenido del archivo.
Podemos presionar la barra espaciadora para desplazarnos hacia abajo.
Vemos los resultados de WordCount en este archivo. Cada línea es una palabra en
particular y la segunda columna es el recuento de cuántas palabras de esta palabra
en particular se encontraron en el archivo de entrada.
Puedes presionar q para salir
2How do I figure out how to run Hadoop MapReduce programs?
Hadoop comes with several MapReduce applications. In the Cloudera VM, these applications
are in /usr/jars/hadoop-examples.jar. You can see a list of all the applications by
running hadoop jar /usr/jars/hadoop-examples.jar.
Each of these MapReduce applications can be run in the terminal. To see how to run a specific
application, append the application name to the end of the command line. For example, to see
how to run wordcount, run hadoop jar /usr/jars/hadoop-examples.jar wordcount.
The output tells you how to run wordcount:
1 Usage: wordcount <in> [<in>...] <out>
The <in> and <out> denote the names of the input and output, respectively. The square
brackets around the second <in> mean that the second input is optional, and the ... means that
more than one input can be used.
This usage says that wordcount is run with one or more inputs and one output, the input(s) are
specified first, and the output is specified last.

Guia Contador de Palabras Cloudera

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guia Contador de Palabras Cloudera

Uploaded by

Copyright:

Available Formats

UNIVERSIDAD MARIANA

Downloading and Installing the Cloudera VM Instructions (Mac)

 Download and Install VirtualBox.

1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install

2. Download the Cloudera VM. Download the Cloudera VM

On Mac: Double click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip

On Windows: Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract

5. Begin importing. Import the VM by going to File -> Import Appliance

6. Click the Folder icon.

8. Click Continue to proceed.

 Download and Install VirtualBox.

1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install

2. Download the Cloudera VM. Download the Cloudera VM

3. Unzip the Cloudera VM:

Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”

5. Begin importing. Import the VM by going to File -> Import Appliance

6. Click the Folder icon.

8. Click Next to proceed.

The PC BIOS needs to be set to allow virtual technology.

 My screen keeps freezing even though virtual machine is in 'running state'?

 Interact with Hadoop using the command-line application.

Once the page is loaded, click on the Open menu button.

Change the output to words.txt and click Save.

Run cd Downloads to change to the Downloads directory.

Run ls to see that words.txt was saved.

We can see the new file by running hadoop fs -ls

En esta guía aprenderemos sobre los comandos básicos de manipulación de

Run the WordCount program Instructions

 Execute the WordCount application.

5. Run WordCount. Run WordCount for words.txt: hadoop jar /usr/jars/hadoop-examples.jar

En esta guía usaremos Hadoop para ejecutar WordCount. Primero abriremos un

2How do I figure out how to run Hadoop MapReduce programs?

1 Usage: wordcount <in> [<in>...] <out>

You might also like