You are on page 1of 15

Business Intelligence NAME: _______________

Professor Chen Due Date: __________

Part I. Updating RapidMiner for Text and Web Mining

1. Launch Rapidminer by right clicking on the Rapidminer icon and clicking “Run as administrator” as
shown in Fig 1. Then, click “yes” when a new window pops up.

Fig 1

2. Once RapidMiner is open, it will ask you if you want to download updates. Click Yes. If it does not ask
you, click on “Help” on the toolbar at the top of the screen, and click “Update and Extensions
(Marketplace)” as shown in Fig 2.

Fig 2

3. After a short loading time, the “RapidMiner Marketplace” window will open as shown in Fig 3-a. You
need to select “Updates” to display information for updates (or select “Top Downloads” if an
appropriate update(s) is not available for you). The screen shows the current installed version (in
orange) and the latest version (in black) for each product. For example, the installed version for Text
Mining Extension is 5.3.0 and the latest version is 5.3.2, therefore, we should update it with the newest
version. To do so, select ‘Text Mining Extension’ item on the left hand side and its associated
information is displayed on the right hand side as shown in Fig 3-b. Further, you need to select the
other two choices (‘Select for update’ and ‘Install 1 package’) to continue the update (Fig 3-b. Click the
Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-1
‘accept terms …’ and ‘install’ buttons to confirm the update and finally select ‘Yes” to complete the
update process as shown in Fig 3-d.

Fig 3-a Fig 3-b

Fig 3-c Fig 3-d

4. Repeat the same process to complete the updates for Web Mining Extension and RapidMiner if needed.

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-2
Part II. Loading text into RapidMiner

To perform a Text “Document” process, we will begin by clicking “New Process” Icon from the main menu.
On the left hand side of the screen, we will click on Operators, and will follow by expanding (clicking the plus
(+) sign) Text Process ->Create Document as shown in Fig 1-a. There are two ways to input text into
RapidMiner. This tutorial will cover each of them.

1. Copy/Pasting text into an operator:


i. Open a new document (e.g., Anyone lived in a pretty how town.txt, Fig 1-a) and double click on
“Create Document” operator and the operator is added into the Main Process window. Click on the
Edit Text parameter to the right of the screen to add text to the operator as shown in Fig 1-b.

Fig 1-a
Fig 1-b

Desired Operator Edit Text

Fig 1-c

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-3
ii. In the window that opens, shown in Fig 2-a, insert the (copy) text to be analyzed (in Fig 1-a)
and paste to the ‘Text Parameter Text’ window (Fig 2-b) and then click ‘Apply’ as shown in
Fig 2-b. For this example we will be using the poem by E. E. Cummings “anyone lived in a
pretty how town.”
iii. The Create Document screen is shown in Fig 2-c. If you click “RUN” the result screen is
shown in Fig 2-d.

Fig 2-a Fig 2-b

Fig 2-c

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-4
Fig 2-d

2. Reading input text from a file (or multiple files) stored on your computer:

2a). Reading text from a file stored on your computer:


i. Click “New Process” Icon from the Main Menu, and search for the “Read Document” operator
under Text Processing -> Utility) and add it into the Main Process window. After clicking on the
operator, click on the yellow folder (browse) to the right of the screen to search for the file you want
to import as shown in Fig 3-a.
ii. Search for the document you want to load into the operator, and click Open, as shown in Fig 3-b and
Fig 3-c.
iii. If you click “RUN” and enter a name for a Repository location if the window is appeared (Fig 3-d).
Select OK, the result screen is shown in Fig 3-e.

Browse

Fig 3-a

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-5
Fig 3-b

Fig 3-c

Fig 3-d

Fig 3-e

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-6
2b). Reading text from multiple files stored on your computer:

i. Click “New Process” Icon from the Main Menu, and search for the “Process Documents from
Files” operator (available under Text Processing -> Utility) and add (drag) it to the main
Process window. By clicking the Edit List parameter shown in Fig 4-a, we can specify where
the files we want to load are located in.

Fig 4-a

ii. Once you have clicked the Edit List button, a new window will appear like the one in Fig 4-b.
By clicking the Folder button, we can tell RapidMiner where the files are located. Input any
name under “class name” as shown in Fig 4-c. Then click “Add Entry” to add another file as
shown in Fig 4-d. When all files are added click “Apply” to complete the entries as shown in
Fig 4-e.

Fig 4-b Fig 4-c

Fig 4-d Fig 4-e

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-7
NOTE: This operator (Process Documents from Files) requires other operators inside of it to provide
functionality (i.e. Tokenizers and Filters). If you do not add any other functionality to it, the process will not run.
See the examples next for exploring Text Mining.

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-8
Business Intelligence NAME: _______________
Professor Chen Due Date: __________
Tutorial for Text Mining with RapidMiner

For this example we will be using the book “Dracula” by Bram Stoker

1. To begin, load the book (Dracula.txt) form the text file provided using the “Read Document” operator
as shown in the previous tutorial on loading a text file into Rapidminer (Fig 1-a and Fig 1-b). Next,
include a “Process Documents” operator as well into the process as shown in Fig 1-c.

Fig 1-a

Fig 1-b Fig 1-c

2. For this example we will be using the “Binary Term Occurrences” for the word vector creation, which
can be selected from the drop-down menu in the parameters (others include: TF-IDF the default, Term
Frequency and Term Occurences) shown in Fig 1-d. Also, re-connect between “Process Documents”
and “Read Document” operators and with the res (results) as shown in Fig 1-d.
3. Double click the “Process Documents” operator to change its behavior and a new “Vector Creation”
window is displayed as shown in Fig 2-a.

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-9
Vector
Creation

Fig 1-d

Fig. 2-a

4. Once inside, place the “Transform Cases,” (availlable at Text Process -> Transformation) “Tokenize,”
(Text Process -> Tokenization), “ Filter Stopwords (English),” and the “Filter Tokens (by Length)”
operators (both operators are available at Text Process -> Filtering) and re-connect all of them as shown
in Fig 2-b.

Fig 2-b

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-10
5. The “Transform Cases” operator transforms all characters into lowercase letters by default (the
information can be shown if the “Transform Cases” operator is selected). It can also be configured to
transform them into uppercase letters. This operator should be used since Rapidminer considers
uppercase and lowercase letters to be different, therefore two words that are the same would be
considered different if one begins with a capital letter.
6. The Tokenize operator separates the document loaded into “tokens” by whichever parameter is selected.
By default, it separates the document by “non-letter” characters, which is what we will be using for this
example. The other parameters are outside of the scope of this example.
7. The Filter Stopwords operator filters out common words in the language selected (English in this case),
like “The” and “a” for example.
8. The Filter Tokens (by Length) operator filters out tokens that have less than a minimum specified
number of letters or exceed a maximum number of characters set. For this example, set the minimum to
two (2) and the maximum to 999 as shown in Fig 3. This will filter out any one (1) letter words such as
“I”

Fig 3

9. Press the “PLAY” button to run the process.


10. The “WordList” tab will show all of the words extracted from the file, and shows how many times each
word occurred as shown in Fig 4.
11. By clicking the headers of each column, the columns can be sorted. For example: If the content is
numerical, as in the “Total Occurrences” column, if we click “Total Occurrences” the rows will be
sorted from least to most, and if it is clicked again the sorting will be reversed. If the column is
alphanumerical, the rows will be sorted alphabetically.

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-11
Fig 4

12. Now we will add (by double-clicking) the “Generate n-grams (Terms)” operator (available at Text
Processing -> Transformation) to the Process document operator. Then set its “max length” parameter to
two (2) as shown in Fig 5.

Fig 5

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-12
13. An n-Gram is a combination of ‘n’ consecutive terms in a sentence. In the sentence “the quick brown
fox jumps over the lazy dog,” “quick brown” is considered a 2-Gram. While “quick brown fox” is a 3-
gram, and so on. When generating n-Grams, Rapidminer does not filter out single words, since one
word is a 1-Gram, and therefore still an n-Gram.
14. By adding another operator, we can filter out anything under a 2-Gram, therefore displaying more
relevant data. This operator is the “Filter Tokens (by Content)” operator, and it is shown in Fig 6.

Fig 6

15. Since Rapidminer separated n-Grams with an underscore ( _ ), we set the new operator to store anything
containing an underscore (see Fig 6), which should be any n-Gram with an n of two (2) in this case. The
results of this process can be seen in Fig 7.

Fig 7

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-13
16. If we want to generate and only see any 3-Grams in the document, all we have to do is change the “Max
Length” parameter of the “Generate n-Grams” operator to 3, and change the condition of the Filter
Tokens (by Content) operator to “Matches.” We have to input a Regular Expression that will search for
anything with two (2) underscores ( _ ) in it. The regular expression we will use is “.*_.*_.*” (without
the quotes). This is shown in Fig 8-a. The symbols (characters) for creating (editing) “Regular
Expression” can be found and shown in Fig 8-b.

Fig 8-a

Fig 8-b
Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-14
17. An example of the results is shown in Fig 8-c.

Fig 8-c

Updating RapidMiner for Text & Web Mining and Loading Text into RapidMiner; Page-15

You might also like