You are on page 1of 6

Research Methodology for Educational Data Mining in India

Research Methodology for Educational Data Mining in India
Nidhi Chopra1 and Manohar Lal2
School of Computer and Information Sciences, Indira Gandhi National Open University, Maidan Garhi,
New Delhi – 110068, India, and
ABSTRACT the corporation. People in all fields and disciplines are
As the world around is going through a technological becoming more and more informed and are learning to observe,
revolution with the dawn of digital age, we are in some ways collect and interpret data trends around them to make better and
compelled to rethink our education system and its components. informed decisions. Hence, there is no reason field of education
With the tools and the techniques available to us nowadays it’s should be any different or left behind [1].
imperative for us to reconsider how we can use those to
improve our education system. Opportunities for knowledge 2. EDUCATIONAL DATA MINING
discovery in educational data have increased tremendously In the last few years EDM has emerged as a field of its own.
now as compared to the scenario a few years ago. Educational The EDM community website,
data is becoming increasingly rich as more and more, defines EDM as follows:
educational systems are going online and collecting large “EDM is an emerging discipline, concerned with developing
amounts of data. In this paper we will present a study on methods for exploring the unique types of data that come from
developing a research methodology for educational data educational settings, and using those methods to better
mining. We will focus specifically on the Indian education understand students, and the settings which they learn in. [2]”
system and compare it to another university data in USA. Data mining (DM), or Knowledge Discovery in Databases
Opportunities that are available to us for making effective (KDD), is the field of discovering novel and potentially useful
conclusions using educational data mining are discussed as information from large amounts of data [3]. Largely it consists
well. of analyzing available sets of data to interpret and isolate the
trends and patterns present in the data i.e. converting raw data
KEYWORDS into information so that it can be used by educators,
Educational Data Mining, Research Methodology, Indian educational software developers, teachers, parents or students.
Education System. However, it is largely understood that EDM methods are often
different from standard DM methods. This is because of the
1. INTRODUCTION non-independence and multilevel hierarchy found in
In this new digital age, the world of education has also gone educational data [4]. Actually because of this reason, it is
under a major transformation. The new technologies and increasingly common to see the psychometrics models being
gadgets available help us not only enrich and enhance our used in EDM.
existing education system but also offer new opportunities and
modes which can take the process of learning beyond
institutions and allow people to learn on their own time and
own terms. These new advances in learning have played a big
role in this age of knowledge enhancement via different means
and are clearly a sign that we need to rethink, how we can tap
into the technology potential to improve our education system
[1]. As of now most of the changes can be seen in the way
information is distributed or provided to the students such as e-
learning, distance learning, blended learning etc. Another area
where we are beginning to see an impact is educational data
mining (EDM).
Fig 1: DM as a confluence of multiple disciplines
The knowledge revolution has already transformed most work
practices and professional jobs. Many jobs which were work
Figure 1 above shows how DM can be visualized as a
intensive before have become more knowledge intensive now.
confluence of multiple disciplines. In figure 1 the area of study
The nature of work has changed from being task oriented to
would be education. The data can be collected from students'
inferential and abstract oriented. Even if you consider a
use of interactive learning environments, computer-supported
secretarial job for an example, where typically in past that
collaborative learning, or administrative data from schools and
would have involved typing documents or memos, now that
involves handling interactions with people inside and outside

Copy Right © INDIACom-2012; ISSN 0973-7529; ISBN 978-93-80544-03-8

These tools have pre written code or libraries of most DM algorithms [3]. Other than these. some of them are public and some can be created without programming but require problem- others can be accessed on request. and the problems that contain each knowledge component. They have designed some tools like: there are steps in solving this equation. assesses success of students and teachers in a class. The tool asks the student to key-in each and every step. Error Rate (%) TuTalk is for authoring and experimenting with natural b. PSLC (Pittsburgh collect or questions that can be asked. Understanding the current trends of our education system and at-a-glance information on problem coverage – society would clearly point out towards the underlying issues the number of students exposed to a particular and help us device an effective plan to address them. It supports the body of precise theoretical principles of instruction. student drop out. ISBN 978-93-80544-03-8 . methods like: TagHelper is a tool for applying machine learning technologies to text processing problems. Assistance Score language dialogue in tutoring systems and learning research. c. Following measures can be the ELI.Research Methodology for Educational Data Mining in India There are various challenges in the upcoming field of education e. These are collected with the help of Cognitive Tutor. The patterns found can be used to improve classroom teaching and the tool itself. RapidMiner. 1 Performance profiler: a multi-purpose report. (Predicted−Actual) assessment etc. Average Number of Hints 3. Residual Error Rate Percentage like understanding choice of major. which require AI researchers who excel in MS-Excel or have programming to build a cognitive model of student problem coding/programming proficiency. Figure 2 step or knowledge component. do they realize the 5 TuTalk exceptional case of division by zero [6] and so? Do they break A key goal of PSLC is to support learning scientists in down the problem of multiplication in addition steps? This providing explanations of results using. EDM FOR ADMINISTRATORS An example of EDM for Admin is enrolment data collected by SRD (Student Registration Division) of IGNOU (Indira Gandhi Copy Right © INDIACom-2012. English LearnLab Data or ELI's Online Data Search System for something akin to an educational researcher's student data collected through the various course offerings of Swiss Army Knife. 3. a tool used by students in schools and colleges. This is a good platform for specific authoring. retention. SPSS. There are freely available research papers also research. to show what kinds of studies have already been conducted – like finding patterns and correlations. Student input is stored 1 DataShop and above mentioned analysis can be performed. there are restrictions on the kind of data they can educational data repository at LearnLab [5]. and cognitive tutors. the same core terminology and addressing an accumulating CTAT is a tool for preparing intelligent tutors. They have some predesigned solving but support tutoring across a range of problems. CMU (Carnegie Mellon Example 1: A student has to solve an equation ax + b = c. creation of two types of tutors: example-tracing tutors. the knowledge shows the two possible dimensions of EDM. Average Number of Incorrects d. ISSN 0973-7529. students entered and the feedback they received. DM is a field which has step or knowledge component. which is part of the University of Pittsburgh. as much as possible. EDM FOR EDUCATORS An excellent example of EDM for educators is a one-of-a-kind However. popular tools like WEKA. These problems can be solved by using 2 Error report: summaries of student performance by prediction methods [3] of DM. viewed: a. These data sets need no cleaning and have no errors Fig2: Two possible dimensions of Educational Data Mining or empty fields. 3 Learning Curve: visualizes changes in student performance over time. R and Matlab can also be used for the analysis. These data sets can be opened in excel and key support for attributes is included separately. actual values originated from databases and Artificial Intelligence (AI). Now University). components associated with each step. 2 CTAT (Cognitive Tutor Authoring Tools) Example 2: An interesting problem here could be – conduct a 3 TagHelper pre-test and a post test in mathematics class and see how 4 ELI online data students learn division and multiplication. Science of Learning Centre). which DataShop provides datasets.

EDM is a new field. We can identify the reasons for wide how it is viewed by the world. methods.Research Methodology for Educational Data Mining in India National Open University). This family income. what are other motivational nutritional food and necessary health care facilities. we should try to fit the conclusions to the data and not the other way around even if Copy Right © INDIACom-2012. The word 3. This study and its results will be columns contain sensitive information. An EDM research coin – positive and negative. Books and papers with a focus on student’s profiles.e. social science. 2. enrolment data of disabled student for an entire year 2009 was 4. Exploration – Exploring hidden causes behind a to be answered – phenomenon i. all these attributes have to be considered to to know not only the research methods/techniques which are accurately separate the most influential factors or variables. best evidence to prove the vitality of such a research. prediction. good and bad. a wide variety of DM methods are field of education. from others. Relevance of the can help us in the following areas – research is a key decisive factor. which covers categories like physical. We can identify the influential factors and particular problems and issues are not there. ISBN 978-93-80544-03-8 . ISSN 0973-7529. In research methodology. mental or should work as per rules laid down by authorities learning disability. in EDM. selected for life style can be used to predict health conditions. betterment of mankind and not for their personal reasons or grudges. Before that we will discuss about creed. Researcher and authorities should themselves remain 4. One such important attribute is ‘finance’ or environment. EDM is contributing factors from this study. in many ways. relationship mining. examining evidences. food/nutrition a researcher can’t collect them individually. like-religion. An EDM researcher has to capture patterns in the 5. marital status. liking/disliking of a particular method or subject? Most research is coming from academics. Understanding problems of students from an educational Evaluation of research involves focusing on both sides of the psychology point of view is important here. However. a ‘parental income’ which determines health. considered as an important R&D area by computer science. A research is important for the from such families are very likely to suffer from disability or organizations involved it. management and many related fields. While dealing interesting patterns were obtained telling us what interested with enrollment data of students it can be seen that some these special students. and children. Who – Researchers. free from authoritarianism. EDM research can be conducted in an unbiased attributes [8]. How does the only a handful of journals and books in this area. There are following questions 1. It is necessary for the researcher At some level. For this research. Explanation – Comparing two or more theories in an obtained from them in January 2010. and distillation of data for human judgment. e. describing characteristics of a population or its It is also important to tie in the final objective of the research subsets. Where – In what kind of environment research can be ‘characteristics’ here means distinguishing properties or conducted. Why – Other than the above. administrator and teacher. Prediction – Identifying relationships P  Q. participants and consumers of the scope for exploring and identifying the factors affecting research. lacking or delayed. There are environment of a student affect their concentration. When – What is the best possible time to conduct a research. applying and verifying it as assumptions and all the criteria under consideration. RESEARCH METHODOLOGY unbiased towards each other and the students. This well. Similar Research methodology is a way to systematically study and controversial attributes are gender. solve a research problem [7]. 3. In the Currently. A researcher and the authorities under whom he or she is developing a research methodology for the analysis of an working should know that research being done for the educational data set. SPSS is the most Such answers can be partially predicted on the basis of popular software in this field. There are disparities in performance of students in a class. After data cleaning some unbiased manner. So P aids in Q’s becomes irrelevant later. discovery with models. Competences and biases of the researcher play a role. caste and discussed in sections later. Data sets to be used here are so huge. educators and educands. analysis and why taking into account all the underlying 5. even if it of Q and Q happens after (follows) P. What – Topic and theory on which the research is based and difficult to capture. Culture and religion have a great impact on lives of students the researcher selects what tools should be selected for the especially in countries or communities with diversity. These authorities exposure to technology and information. Another attribute is should respond to researcher’s request if they can. 4. For example very poor families mostly can’t afford 5. history & tags/keywords can be easily imported in MS-Excel. These databases are family can buy and also the influence of other luxuries and mostly provided by some educational body. Description – Defining or differentiating a phenomenon education. study some commercial or business concerns with a focus on EDM. P is the cause A true research in DM is relevant at that time. In EDM there is a lot of 1.g. This is about usefulness of a research. Researcher ‘disability’. Human behavior is complex and 2. both. clustering. the action based on such studies is usually available such as prediction. Students reasons behind doing a research. The data is stored using FoxPro and serious ailments. available but also the methodology. The best time is when the problem and its solution are relevant. An action based on implies that the design of research methodology might differ the suggestive results and evaluation of its effectiveness-is the from problem to problem. EVALUATION OF RESEARCH data. Mining of students’ characteristics can be useful to with these fields and understand the impact in each area. Action – Finding a solution.

Tomorrow. THE SCIENTIFIC APPROACH There are no sacred truths. In this research rules generated can’t be generalized to a high degree because data of only one year is being used. 3 Most students opted for English Medium. Data of salary and perks could also be used Fig 4: Choice of medium (count pie chart) for analysis. Fig 3: Age graph for the two cycles/semesters Data is analyzed to state a conclusion. simulated or real world data. after a new research. DM research can be conducted using empirical. 6. under which a research is conducted. 6. are important. Copy Right © INDIACom-2012. E. Scientific research is more about observations followed by 4 Most of the students are male. health issues of students and staff. Other examples of this research are – when assessment data. RESULTS In this section some of the results are presented as obtained while doing the data analysis conducted on disabled students of IGNOU who enrolled for various courses in the year 2009. A researcher must be given some freedom. ISBN 978-93-80544-03-8 . today there is stress on female education which was not there in the past centuries in India. But still a systematic approach has to be followed. and paid an average fee. How – What are the methods of conducting a research i. If more 2 Most of them enrolled for Master of Political Sciences than one theory is possible. To achieve a high degree of generalization. Conditions.Research Methodology for Educational Data Mining in India that does not agree with the overall picture being created by the organization. 6. drop out and retention is used to predict the future pathway of a student. best fit theory is the conclusion. But overgeneralization makes it nonscientific. Education budget has increased but at the same time globalization has made educational organizations look like markets. Fig5: Pie chart for gender 1 Most students were in their late 20s. ISSN 0973-7529. to give a theory which fits the facts (rules here) that have come out of a data. The hypothesis can then be proved that such a functional dependency of attributes is there.e. 5 Most students were pass outs of past 10 years only.g. research methodology (RM). This can also help in testing the rules generated. Then action has to be taken in the form of a pilot study. even this final best fit theory may become invalid. It depends on the demand of the organization providing data and sponsorship. Then it has to be replaced by some other theory. data from a long duration should be used. There’s lot of competition between different organization which is helping shape the current system but can also lead to negative outcomes if not managed carefully. There should not be too many rules as in Fuzzy Systems. Similarly data of leave of students.e. logical analysis to generalize inductively. teacher and other staff can also be used for medical DM i.

This [1]. And to make them accessible – brail translation. 400 Transformations are made. ISBN 978-93-80544-03-8 . Figure 3 has shape of a family. www. But system in IGNOU is easy to do humanities courses because there is no certainly an evolving one. government hold on the data in the two universities (CMU & 2 Result 2 is due to the fact that these students find it IGNOU) in the two democracies. providing student enrollment data for this study and the team at 4 Result 4 shows that disability is more common in SOCIS IGNOU for their support. Finding what determines the success of a disabled student in 3 English medium books are comparatively easily distance education. After converting it in to a desirable format. [3]. Rethinking education requires looking back to our roots – Ayurveda in the age of technology: the digital revolution and (diet and medicine). Staying close to nature (farming. 7. [4]. A sample study 0 could be finding an E-learning model based on AI techniques 1960 1970 1980 1990 2000 2010 2020 that fit your data. is a problem as our science laboratories FUTURE WORK have no accessibility areas or equipment. acknowledges team members & mentors at PSLC Summer 5 Result 5 is self-evident. they 6 Most of them are unemployed. for their suggestions in this work. Machine Learning. popular books/texts in all subjects to help them. Learning from the research is just help in the form of artificial limbs & training to one part. Teachers College Press. Implementation or action based on that makes the use them in laboratories (sciences).H. so School 2011 (Prof. REFERENCES 7 Result 7 indicates poor life style in urban areas. (2009). The authors also males in this data sets. schooling in America. Travelling to research effort worth it. It makes Next step in this research is to analyze the assessment data of sense to opt for courses which require no or less these students using DM techniques – like tuple analysis. ancestors for us. have a very clean data. Neetu Chopra (Scientist. is the next stage of this research. 200 Second stage is reporting the research as paper or thesis. even just to visit and have a look. various laboratories. IGNOU however plans to research can help us provide useful insights in the Indian launch courses in regional languages. resources. Yoga. Sports.M.educationaldatamining. a few degrees or a big biological data [9]. normal distribution as often exhibited in Assessment is not just 10th and 12th. Kaufmann Publishers. DISCUSSION OF RESULTS monthly/annual income of the family and other personal data of 1 Result 1 is (accessed on 19 simple living and high thinking created by our December 2011). I. Erik Zawadzki and they have mostly passed out in recent past. In the mentioned two categories of EDM it was observed that Fig6: Pass out year (count scatter plot) there is a difference in these two researches. Arts and Music of Indian origin. Data Mining – Practical gardening) can help cleanse the body and soul of Machine Learning Tools and Techniques. Collins A. More steps education system and help us to understand the factors that can be taken are – to encourage translation of affecting students and improvise on them in the long run. individual/student and her assessment if possible. At IGNOU there may be a requirement of including a few more fields – degree of disability. attributes they want. & Frank E. the data is cleaned. (2005). (1997). ISSN 0973-7529. [2]. Meditation. Kenneth R. Games. Plextronics) 6 is shape of a chi-square distribution. Halverson R. Mitchell T. Students are in their 20s. Morgan humans. a program or tool can be run on it to find patterns. After acquiring data. They also get to decide & choose the 7 They are from urban areas. ACKNOWLEDGEMENTS audio books (record and release).Research Methodology for Educational Data Mining in India CONCLUSION Year of pass out (Count) EDM research follows a very simple & straight style like any DM problem. which has to be done in a prescribed format and flow. Figure Sabestian Lalle') and Dr. This available in India. and video field Authors thank the Student Registration Division at IGNOU for tours.. Copy Right © INDIACom-2012. Witten. value addition in their twenties or early thirties There are some differences in the kind of political or preferably. The personality and performance of every pupil the model of our current education system where have to be judged throughout the academics and the most people like to study or focus on their career employment/career. McGraw-Hill. Koedinger. This also resonates well with project report. 6 Result 6 requires action from Governments to create accessible jobs to increase employment. Since at PSLC the tools are being designed by those involved in the research.

Morgan Kaufmann (accessed on 17 August. PHI Pvt.S. Jiawei H. http://learnlab. http://learnlab. [7]. 2011). Rao P.S.Research Methodology for Educational Data Mining in India [5]. Brooks/Cole Publishing Company. [9]. Data Mining: Concepts and Techniques. ISSN 0973-7529. 2011). An Introduction to Biostatistics – A Manual for Students in Health Sciences. & Kamber M.C. (2006). [8]. (1999).zip (accessed on 17 August. [6].. Dane F. Ltd. Richard J. Copy Right © INDIACom-2012. (1990).S. Research Methods. 3rd 011/PSLCSummerSchoolPostersAndFirehoseSlides201 1. ISBN 978-93-80544-03-8 .