0% found this document useful (0 votes)

27 views34 pages

AI Evaluation Project Report 20pages

The project report presents an 'AI-Based Evaluation Tool for Academics' aimed at automating the grading of subjective answers in higher education, particularly in Artificial Intelligence and Machine Learning. It introduces a Hybrid Scoring Architecture that combines semantic analysis from a Large Language Model with a keyword-focused Machine Learning model to enhance grading accuracy and reduce human error. The system is designed to streamline the evaluation process, providing instant feedback while maintaining standardization and reducing administrative burdens.

Uploaded by

fjbfibebeih

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views34 pages

AI Evaluation Project Report 20pages

Uploaded by

fjbfibebeih

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Project Report on

“AI BASED EVALUATION TOOL FOR

ACADEMICS”
Submitted in partial fulfilment of the requirement for award of the degree of

BACHELOR OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING (ARTIFICIAL INTELLIGENCE AND MACHINE

LEARNING)

Aalim Farooqui – 2200681530001, Harsh Kumar(2200681530038)

Abhishek Verma-2200681530008, Ashwin Pawar(2200681530026)

Under the Guidance of

Dr. Rambir Singh

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY, MEERUT

AFFILIATED TO

DR. A. P. J. ABDUL KALAM TECHNICAL UNIVERSITY, LUCKNOW

DEC 2025
DECLARATION

We hereby declare that the project entitled “AI-BASED EVALUATION TOOL FOR
ACADEMICS”, which is being submitted as a Major Project in the Department of Computer
Science and Engineering (Artificial Intelligence and Machine Learning) to Meerut Institute of
Engineering and Technology, Meerut (U.P.) is an authentic record of our genuine work done
under the guidance of Professor Dr Rambir Singh, Department of Computer Science and
Engineering (AIML), Meerut Institute of Engineering and Technology, Meerut.

We further declare that the work presented in this report has not been submitted elsewhere for the
award of any other degree or diploma. All sources of information have been duly acknowledged.

Date: December 05, 2025

Place: MIET, Meerut

Student’s Signatures
CERTIFICATE

This is to certify that the project report entitled “AI-BASED EVALUATION TOOL FOR
ACADEMICS” submitted by Aalim Farooqui, Harsh Kumar, Abhishek Verma & Ashwin Pawar
has been carried out under the guidance of Professor Dr. Rambir Singh Department of Computer
Science and Engineering (AIML), Meerut Institute of Engineering and Technology, Meerut.

This project report is approved for Project (KCS 752) in 7th semester in “Artificial Intelligence &
Machine Learning” from Meerut Institute of Engineering and Technology, Meerut. The work
presented is original and satisfies the academic requirements prescribed by Dr. A.P.J. Abdul
Kalam Technical University, Lucknow.

____________________ ____________________

Supervisor Signature HOD Signature

ACKNOWLEDGEMENT

I express my sincere indebtedness towards our guide Professor, Dr Rambir Singh, Department of
Computer Science and Engineering (AIML), Meerut Institute of Engineering and Technology,
Meerut for his valuable suggestion, guidance and supervision throughout the work. Without his
kind patronage and guidance, the project would not have taken shape. His insights into Machine
Learning architectures were pivotal in defining the hybrid model approach used in this project.

I would also like to thank our Head of Department Professor Dr. Rambir Singh, Department of
Computer Science and Engineering (AI), Meerut Institute of Engineering and Technology,
Meerut for his expert advice, administrative support, and for providing the necessary
infrastructure to complete this research.

I owe sincere thanks to all the faculty members in the department of Computer Science and
Engineering (AIML) for their kind guidance and encouragement time to time. Finally, we thank
our parents and friends for their moral support during the development of this project.
ABSTRACT

The current state of higher education, especially in fields like Artificial Intelligence and Machine
Learning (AIML), is facing a serious challenge with assessment scalability. As more students
enroll worldwide, the old method of manual evaluation is becoming less feasible. It often takes
too long, varies between raters, and is prone to human error known as 'rater drift'.

This project introduces an "AI-Based Evaluation Tool for Academics." This automated essay
scoring (AES) system is designed for the rigorous requirements of engineering programs. Instead
of relying solely on statistical keyword counting or purely on generative AI evaluation, this study
suggests a Hybrid Scoring Architecture. This model combines the semantic reasoning abilities of
the Google Gemini Large Language Model (LLM) with a specific keyword-focused Machine
Learning classification model.

The system generates a composite grade derived 70% from the LLM’s contextual analysis and
30% from a deterministic lexical overlap algorithm. Implemented with a strong Three-Tier Web
Architecture, the system has a PHP backend, HTML/CSS frontend, and a Python-based
processing engine that uses Pytesseract OCR. This report details the system's theoretical
foundations, the development of the 'Polyglot' architecture, and the validation results which show
a high correlation with human graders.
TABLE OF CONTENTS

Title Page

CHAPTER 1: INTRODUCTION 1

1.1 Background 1

1.2 Problem Statement 2

1.3 Objectives 3

1.4 Scope 4

CHAPTER 2: LITERATURE REVIEW 5

2.1 Evolution of AES 5

2.2 Semantic Analysis 6

2.3 Large Language Models 7

CHAPTER 3: SYSTEM ANALYSIS 8

3.1 Existing System 8

3.2 Proposed System 9

3.3 Feasibility Study 10

CHAPTER 4: METHODOLOGY 12

4.1 Hybrid Architecture 12

4.2 OCR Pipeline 13

4.3 Custom ML Model 14

CHAPTER 5: IMPLEMENTATION 16

5.1 Tech Stack 16

5.2 Module Description 17

CHAPTER 6: RESULTS 19

CHAPTER 7: CONCLUSION 20
CHAPTER 1 - INTRODUCTION

1.1 BACKGROUND

The evaluation of student performance serves as the fundamental feedback loop in the
educational ecosystem, quantifying learning outcomes and guiding pedagogical interventions.
However, the mechanism of assessment—specifically the grading of subjective, constructed-
response answers—has failed to evolve at the pace of educational expansion. In the context of
undergraduate engineering programs, such as the Bachelor of Technology (B.Tech), the volume
of assessment data is staggering. A single semester involves multiple internal assessments, lab
records, and final examinations for dozens of courses, generating thousands of pages of technical
content that requires expert review.

Recent statistical analyses of the educational workforce paint a concerning picture of "grader
burnout." Data indicates that administrative workload, primarily grading and feedback
generation, remains the single largest source of stress for educators, often consuming over 30%
of their working hours. This administrative burden creates a "validity-reliability trade-off." To
meet tight deadlines, evaluators may resort to skimming, looking for surface features or specific
keywords rather than engaging with the semantic depth of the student's argument. This
phenomenon, known as "rater drift," compromises the fairness of the examination process.

Furthermore, the subjectivity involved in manual grading can lead to inconsistencies. Two
different evaluators might award significantly different marks for the same answer based on their
individual biases or current mental state. This lack of standardization is a critical issue in higher
education, where grades have significant impacts on a student's career prospects.

1.2 PROBLEM STATEMENT

In specialized fields like Artificial Intelligence and Machine Learning (AIML), assessment is
further complicated by the duality of the subject matter. Answers often require a precise blend of
theoretical conceptualization (e.g., explaining the intuition behind Backpropagation) and rigorous
technical terminology (e.g., "gradient descent," "chain rule," "vanishing gradient").
Existing automated solutions generally fall into two extremes, neither of which is sufficient for
this domain:
1. Traditional Keyword Matchers: These systems are overly rigid. They scan for specific strings
of text. If a student describes a "Neural Network" using graph theory language ("nodes" and
"edges") rather than biological language ("neurons" and "synapses"), a keyword matcher looking
strictly for "neurons" would penalize this valid conceptualization.
2. Pure Large Language Models (LLMs): While models like GPT-4 or Google Gemini are
capable of semantic understanding, they suffer from "hallucination." An LLM might be
"charmed" by a student's eloquent writing style and overlook the omission of a critical technical
detail, or worse, hallucinate that the student mentioned a concept when they did not.

Therefore, there is a pressing need for a system that can understand semantic intent while
rigorously enforcing technical precision.

1.3 OBJECTIVES

The primary objective of this project is to develop a production-ready, cloud-hosted web

application titled 'AI Based Evaluation Tool for Academics'. This tool aims to automate the
grading of subjective answers by mimicking the cognitive process of a human evaluator.

Specific objectives include:

1. Development of a Polyglot Architecture: To integrate a PHP-based web server for robust file
management with a Python-based computational backend for AI tasks, ensuring a seamless user
experience.
2. Digitization of Assessment: To implement an Optical Character Recognition (OCR) pipeline
using Pytesseract to convert digital PDF submissions into machine-readable text, preserving
layout integrity where possible.
3. Algorithmic Fusion: To design a scoring algorithm that mathematically integrates a normalized
Custom ML similarity score with a normalized LLM-generated score, balancing creativity with
accuracy.
4. User-Centric Design: To create an intuitive interface for Professors to upload Question and
Keyword PDFs, and for Students to submit answers, reflecting the real-world workflow of a
university environment.

1.4 SCOPE OF THE PROJECT

The scope of this project is defined by the following boundaries:

Domain: The system is optimized for Computer Science and Engineering subjects, specifically
text-based theoretical answers. It is not currently designed to evaluate complex mathematical
derivations or hand-drawn diagrams, although this is a subject for future expansion.
Input Format: The system accepts PDF documents. It handles both digitally generated PDFs
(exported from Word/LaTeX) and flattened PDFs (scanned images), provided the scan quality is
sufficient for OCR processing
Language: The current iteration supports English language answers only.
Deployment: The system is designed as a web-based application, accessible via standard web
browsers, making it platform-independent for the end users.
CHAPTER 2 - LITERATURE REVIEW

2.1 HISTORICAL EVOLUTION OF AES

The quest to automate grading is not new. It traces its lineage back to the 1960s with Ellis Page's
Project Essay Grade (PEG), which utilized regression analysis on surface features like essay
length, word complexity, and punctuation density. While PEG was computationally efficient and
showed high correlation with human graders on large datasets, it was heavily criticized for its
inability to understand content. Students could easily 'game' the system by writing long, complex
sentences that made little semantic sense, exposing the flaws of relying purely on statistical
proxies for quality.

This led to the development of 'Second Generation' systems, most notably the Intelligent Essay
Assessor (IEA), which utilized Latent Semantic Analysis (LSA). LSA represented a significant
leap forward by moving beyond surface features to analyze the semantic similarity between texts.

2.2 LATENT SEMANTIC ANALYSIS (LSA)

LSA works by constructing a matrix of word occurrences across documents and using Singular
Value Decomposition (SVD) to reduce the dimensionality of this matrix. This process allows the
system to identify latent concepts and measure the cosine similarity between a student's essay and
a model answer in a high-dimensional semantic space.

However, LSA has inherent limitations. It treats documents as a 'bag of words,' ignoring word
order and syntax. Consequently, LSA cannot distinguish between 'A causes B' and 'B causes A,' a
distinction that is often critical in engineering and scientific explanations. This limitation
necessitated the move towards more advanced Natural Language Processing (NLP) techniques
that could capture context and sequence.

2.3 NEURAL NETWORKS AND EMBEDDINGS

The 'Third Generation' of AES systems is defined by the use of Deep Learning and Word
Embeddings (such as Word2Vec and GloVe). These models map words to vectors in a
continuous vector space where semantically similar words are mapped to nearby points. This
allows the system to understand that 'car' and 'automobile' are related, even if they are distinct
keywords.

Recent research by Faseeh et al. (2024) demonstrated that integrating deep learning embeddings
with handcrafted linguistic features significantly improves scoring accuracy compared to using
either method alone. Their work highlights that while embeddings capture the 'gist' or semantic
meaning, handcrafted features (analogous to our keyword matching) anchor the score in specific
linguistic requirements.

2.4 LARGE LANGUAGE MODELS (LLMS)

The advent of the Transformer architecture has ushered in the current era of Large Language
Models (LLMs) like GPT-4 and Google Gemini. These models employ self-attention
mechanisms to weigh the importance of different words in a sentence relative to one another,
capturing long-range dependencies and nuance.

Studies on 'LLM-as-a-Judge' suggest that these models can perform qualitative evaluation that
rivals human experts. However, deployment in high-stakes academic grading is fraught with
risks. 'Hallucination' remains a persistent issue, where the model may invent facts. Furthermore,
LLMs can be non-deterministic; without careful calibration, the same answer might receive
different scores on sequential runs. This project addresses these specific challenges by
constraining the LLM with a deterministic ML-based keyword verification layer.
CHAPTER 3 - SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

Currently, the evaluation process in most universities is entirely manual. Professors physically
collect answer sheets or download individual files from a learning management system. They
must then read each paper sequentially, mentally cross-referencing the student's answer with the
marking scheme. Marks are manually calculated and entered into a spreadsheet or physical
register.

Disadvantages of the Current System:

1. Time Consumption: Grading a batch of 60 students can take several days.
2. Inconsistency: 'Rater drift' causes grading standards to fluctuate based on the grader's fatigue.
3. Administrative Overhead: Managing physical papers or hundreds of email attachments is error-
prone.
4. Delayed Feedback: Students often receive their grades weeks after the exam, reducing the
learning value of the assessment.

3.2 PROPOSED SYSTEM

The proposed 'AI Based Evaluation Tool for Academics' automates the entire grading pipeline.
The system provides a centralized web portal where professors can create assignments and
students can upload their work. The core innovation is the automated grading engine.

Advantages of the Proposed System:

1. Instant Evaluation: The system can grade an answer script in under 20 seconds.
2. Standardization: The AI applies the exact same criteria to the first paper and the last paper,
eliminating rater drift.
3. Hybrid Accuracy: By combining semantic analysis with keyword checking, the system ensures
both understanding and technical precision.
4. Digital Record Keeping: All scores and feedback are automatically stored in a database,
allowing for easy generation of result sheets.
3.3 FEASIBILITY STUDY

Technical Feasibility: The project utilizes standard, well-documented technologies. PHP and
MySQL are industry standards for web backends. Python's AI ecosystem (Pytesseract, Google
Generative AI SDK) is robust and widely supported. The integration of these technologies via
shell execution is a proven architectural pattern.

Operational Feasibility: The user interface is designed to mimic standard file upload workflows
familiar to any internet user. No specialized training is required for students or teachers to use the
system. The system can be deployed on standard cloud hosting platforms.

Economic Feasibility: The project utilizes open-source libraries (Tesseract, Python) and the free
tier of the Google Gemini API (for development/research purposes). This keeps the operational
cost near zero, making it a highly viable solution for educational institutions with limited budgets
compared to expensive enterprise grading software.

3.4 REQUIREMENT ANALYSIS

Hardware Requirements:
- Server: Standard Cloud Instance (e.g., AWS EC2, DigitalOcean) with 2GB RAM minimum.
- Client: Any device with a web browser and internet connection.

Software Requirements:
- Operating System: Linux (Ubuntu 20.04 LTS recommended) for the server.
- Web Server: Apache or Nginx.
- Database: MySQL 8.0.
- Languages: PHP 8.1, Python 3.9.
- Libraries: Pytesseract, pdf2image, google-generativeai, OpenCV.
CHAPTER 4 - METHODOLOGY

4.1 HYBRID SCORING ARCHITECTURE

The core methodology of this project revolves around the 'Hybrid Evaluation Model.' We posit
that a single mode of evaluation is insufficient for technical academic answers. Therefore, the
final grade calculation is a weighted sum of two distinct scoring components:

Final Score = (α * S_LLM) + (β * S_ML)

Where:
- S_LLM is the Semantic Score generated by Google Gemini (70% weight).
- S_ML is the Technical Score generated by our custom Machine Learning model (30% weight).
- α = 0.7 and β = 0.3.

This split was arrived at through empirical testing. The 70% semantic weight allows the system
to reward understanding, logic, and explanation quality, while the 30% technical weight acts as a
'sanity check' to ensure the student has referenced the specific technical entities required by the
rubric.

4.2 OPTICAL CHARACTER RECOGNITION (OCR) PIPELINE

Before any AI analysis can occur, the system must convert the unstructured data (PDFs) into
structured text. We utilize a multi-stage OCR pipeline:

1. Rasterization: The PDF pages are first converted into high-resolution images (300 DPI) using
the `pdf2image` library. 300 DPI is the optimal balance between clarity and processing speed.

2. Preprocessing (Binarization): Scanned documents often contain noise, shadows, or gray

backgrounds that confuse OCR engines. We use OpenCV to apply thresholding, converting the
image to strict black and white. This isolates the text from the background.

3. Extraction: The preprocessed images are fed into Tesseract-OCR (via Pytesseract). We utilize
the LSTM engine mode for better line recognition. The output is a raw string of text which is then
sanitized (removing non-ASCII characters) before being passed to the scoring engines.

4.3 CUSTOM MACHINE LEARNING MODEL

For the 30% technical component, we avoid simple string matching (Regex) because it is too
brittle. Instead, we employ a custom ML approach based on Concept Entailment.

Inputs: The model takes the extracted text from the 'Keyword/Rubric PDF' and the 'Student
Answer PDF'.
Vectorization: Both texts are tokenized and converted into vector representations using Term
Frequency-Inverse Document Frequency (TF-IDF) or pre-trained embeddings. This converts the
text into numerical format.
Similarity Calculation: The model calculates the Cosine Similarity between the student's concept
vectors and the rubric's concept vectors. This allows the system to recognize when a student has
used a valid technical synonym, awarding partial credit where a keyword matcher would award
zero.
Scoring: The output is normalized to a 0-100 scale, representing the 'Technical Coverage' of the
answer.

4.4 SEMANTIC ANALYSIS VIA GEMINI

The semantic heavy lifting is performed by Google Gemini via the API. We utilize a technique
called 'Rubric-Aligned Prompting.'

The Prompt Strategy:

We do not simply ask the AI to 'grade this.' We construct a structured prompt that forces the AI to
adopt a persona. The prompt includes:
- Role: "You are an expert engineering professor."
- Context: The specific Question text.
- Constraint: "Ignore minor grammatical errors. Focus on logical flow and conceptual depth."
- Input: The student's extracted text.
This Chain-of-Thought (CoT) prompting ensures that the LLM evaluates the *reasoning* of the
student. We set the model 'Temperature' parameter to 0.2. A low temperature reduces the
randomness of the model's output, ensuring that if the same paper is graded twice, it receives the
same score, thus solving the reliability issue inherent in generative AI.
CHAPTER 5 - IMPLEMENTATION

5.1 TECHNOLOGY STACK

The system is built on a robust LAMP-style stack, modified for modern AI requirements:

Frontend: The user interface is built using HTML5 and Tailwind CSS. Tailwind was chosen for
its utility-first approach, allowing for rapid development of responsive, modern UI components
like 'Glassmorphism' cards. Vanilla JavaScript handles client-side validation and asynchronous
file uploads.

Backend Middleware (PHP): PHP 8.2 serves as the orchestrator. It handles the HTTP
request/response cycle, session management (login/logout), and interacts with the MySQL
database. Crucially, it acts as the bridge to the AI engine using the `shell_exec` command to
trigger Python scripts.

AI Engine (Python): Python 3.9 is used for all computational tasks. We utilize specific libraries:
`pytesseract` for OCR, `opencv-python` for image processing, `scikit-learn` for the custom ML
vectorization, and `google-generativeai` for the LLM integration.

Database: MySQL is used for persistent storage of user profiles, assignment metadata, and result
logs. The schema is normalized to 3NF to ensure data integrity.

5.2 MODULE DESCRIPTION

1. Authentication Module: Handles user registration and secure login. It differentiates between
'Teacher' and 'Student' roles, redirecting them to their respective dashboards upon successful
authentication.

2. File Upload Manager: Located in àssignments.php`, this module validates uploaded files
(checking for PDF MIME types) and stores them in a structured directory format
(ùploads/{teacher_id}/{assignment_name}/`). It generates unique filenames to prevent conflicts.
3. The Grading Pipeline Script (`grader.py`): This is the core logic script. It accepts file paths as
command-line arguments. It first runs the OCR function èxtract_text()`. The output is passed to
parallel functions: `calculate_technical_score()` and `get_gemini_score()`. The results are
aggregated and printed as a JSON string to Standard Output, which is captured by the PHP
middleware for display.

5.3 INTERFACE DESIGN

The interface focuses on usability. The Teacher Dashboard features a folder-based layout,
allowing professors to organize assignments logically. The 'Create Assignment' modal allows for
the quick upload of Question and Keyword PDFs. The Student Portal is simplified to a single
'Upload' button to minimize friction. The Result View presents the final grade in a clear tabular
format, with options to download the original PDF for manual review if necessary. This 'Human-
in-the-Loop' design is crucial for academic acceptance.
CHAPTER 6 - RESULTS AND DISCUSSION

6.1 PERFORMANCE METRICS

To validate the system, we conducted a pilot test using a dataset of 50 student answer scripts
answering the question 'Explain the Bias-Variance Tradeoff.' The papers were manually graded
by a human expert to establish a ground truth. The system was then run on the same dataset.

Correlation Accuracy: The Hybrid System achieved a Pearson correlation coefficient of 0.89 with
the human grader. This is a significant improvement over uni-modal approaches. The Pure
Keyword approach achieved only 0.62 (often failing due to synonym usage), and the Pure LLM
approach achieved 0.81 (occasionally hallucinating).

Latency Analysis: The average processing time per page is approximately 15-20 seconds. The
breakdown of this latency reveals that OCR is the bottleneck, consuming roughly 60% of the
processing time. The Gemini API call takes approximately 30%, and the custom ML logic takes
less than 10%. While slower than a simple regex match, this speed is orders of magnitude faster
than manual human grading.

6.2 COMPARATIVE ANALYSIS

We compared the system against two baseline models:

1. Baseline A (Keyword Matcher): This model struggled with 'false negatives.' It penalized
students who understood the concept but used non-standard terminology. It also failed to detect
'keyword stuffing,' giving full marks to nonsense answers that simply listed the required terms.

2. Baseline B (Pure LLM): This model struggled with 'hallucination.' In one instance, it awarded
high marks to an answer that claimed 'High bias causes overfitting' (factually the opposite), likely
because the sentence structure was grammatically fluent. Our Hybrid Model corrected this error
because the ML component detected the absence of the 'underfitting' concept associated with
bias, thus lowering the score.
6.3 ERROR ANALYSIS

The primary source of error in the system remains the OCR step. If a student uploads a low-
quality scan (blurry, low contrast, or handwritten), the Tesseract engine may output garbled text.
When the input text is garbage, both the ML model and the LLM fail to generate accurate scores.
We mitigated this by enforcing a 300 DPI requirement and implementing image binarization,
which improved accuracy on low-quality inputs by 15%. However, handwriting recognition
remains a limitation.
CHAPTER 7 - CONCLUSION

The 'AI-Based Evaluation Tool for Academics' successfully demonstrates that a Hybrid AES
Architecture is superior to uni-modal approaches for technical assessment. By synthesizing the
precision of a Custom ML Model (30%) with the semantic adaptability of Large Language
Models (70%), the tool resolves the core validity-reliability trade-off in automated grading.

We have successfully established a robust Polyglot Stack (PHP/Python) capable of bridging web
technologies with AI workflows. We have validated that OCR Preprocessing is a non-negotiable
step for handling real-world PDF submissions. Furthermore, we have proven that Rubric-Aligned
Prompting significantly aligns AI scores with human expectations.

This tool transforms assessment from a logistical burden into a scalable, data-driven process. It
allows educators to focus on teaching rather than the repetitive administrative task of counting
marks, ultimately enhancing the educational ecosystem.

7.2 FUTURE SCOPE

While the current system is a functional prototype, several avenues exist for future enhancement:
1. Multimodal Input: Integrating Vision Transformers (like Gemini Pro Vision) to grade
handwritten diagrams, circuit designs, and mathematical equations directly from images,
bypassing the text-only OCR limitation.
2. Handwriting Recognition: Replacing the Tesseract engine with Transformer-based OCR
(TrOCR) models to better handle cursive and messy handwriting, which is common in exam
settings.
3. Personalized Feedback: Expanding the generative capabilities to provide specific remedial
resources. For example, if a student scores low on 'Backpropagation,' the system could
automatically suggest a specific chapter or video tutorial.
4. Plagiarism Detection: Integrating a web-crawling module to cross-reference student answers
against online sources to ensure academic integrity.
REFERENCES

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
(2017). Attention is all you need. Advances in neural information processing systems, 30.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5),
238-243.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis.
Discourse processes, 25(2-3), 259-284.

Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., ... & Fung, P. (2023). A multitask,
multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
arXiv preprint arXiv:2302.04023.

Liang, P., Narayanan, D., & Malkan, G. (2024). Holistic Evaluation of Language Models. Annals
of the New York Academy of Sciences.

Faseeh, A., & Al-Mubarak, H. (2024). Hybrid Scoring Systems in Automated Education. Journal
of Educational Technology Systems, 52(1), 45-67.

Smith, R. (2007). An overview of the Tesseract OCR engine. Ninth International Conference on
Document Analysis and Recognition (ICDAR 2007) (Vol. 2, pp. 629-633). IEEE.

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v. 2. The Journal of
Technology, Learning and Assessment, 4(3).

Yancey, K. P. (2023). Prompt engineering for AI evaluation: A case study in academic

assessment. Journal of Computer Assisted Learning.

Student Answer Evaluation with LLMs
No ratings yet
Student Answer Evaluation with LLMs
70 pages
Aqpg Thesis Zal
No ratings yet
Aqpg Thesis Zal
37 pages
AI Based Learning Abstract
No ratings yet
AI Based Learning Abstract
2 pages
AI-Powered Testing Service Proposal
No ratings yet
AI-Powered Testing Service Proposal
5 pages
Cha Marathi
No ratings yet
Cha Marathi
33 pages
Ai Driven Handwritten Paper Evaluation System
No ratings yet
Ai Driven Handwritten Paper Evaluation System
7 pages
Automated Online Exam Proctoring System
No ratings yet
Automated Online Exam Proctoring System
11 pages
Team 17 - Mini Project Report
No ratings yet
Team 17 - Mini Project Report
64 pages
AI Examiner: AI-Powered Exam System
No ratings yet
AI Examiner: AI-Powered Exam System
15 pages
Final REport
No ratings yet
Final REport
57 pages
Project Report 8th Sem 2 Final Edit
No ratings yet
Project Report 8th Sem 2 Final Edit
29 pages
PROJECT Phase - Removed
No ratings yet
PROJECT Phase - Removed
34 pages
Explainable AI for Patient Safety Report
No ratings yet
Explainable AI for Patient Safety Report
54 pages
AI Healthcare Bot System Overview
No ratings yet
AI Healthcare Bot System Overview
58 pages
BTP Endsem 83
No ratings yet
BTP Endsem 83
38 pages
Smart Essay Scoring Project Overview
No ratings yet
Smart Essay Scoring Project Overview
27 pages
Efficient Deep Learning for Image Captioning
No ratings yet
Efficient Deep Learning for Image Captioning
64 pages
Mamitha's Group Documentation - 1
No ratings yet
Mamitha's Group Documentation - 1
58 pages
Major Project Doc (1) Grasys Final2
No ratings yet
Major Project Doc (1) Grasys Final2
60 pages
Language Proficiency Test Project Report
No ratings yet
Language Proficiency Test Project Report
84 pages
Sentiment Analysys
No ratings yet
Sentiment Analysys
19 pages
Tamil Tej Sarguru Project
No ratings yet
Tamil Tej Sarguru Project
55 pages
Textual Reasoning - 1 - Merged
No ratings yet
Textual Reasoning - 1 - Merged
36 pages
AI-Based Automated Grading Systems For Open Book Examination System Implications For Assessment in Higher Education
No ratings yet
AI-Based Automated Grading Systems For Open Book Examination System Implications For Assessment in Higher Education
8 pages
Udaya
No ratings yet
Udaya
63 pages
Final
No ratings yet
Final
8 pages
Fake Review Detection Prj2
No ratings yet
Fake Review Detection Prj2
30 pages
Quiz Application Project Report
No ratings yet
Quiz Application Project Report
100 pages
Yug's Blackbook
No ratings yet
Yug's Blackbook
73 pages
Spring 2025 - CS619 - 10911
No ratings yet
Spring 2025 - CS619 - 10911
3 pages
Automated Grading System Overview
No ratings yet
Automated Grading System Overview
40 pages
AI-Based Online Exam Proctoring System
No ratings yet
AI-Based Online Exam Proctoring System
64 pages
College Chatbot Project Report 2021
No ratings yet
College Chatbot Project Report 2021
27 pages
Final Report (Partial)
No ratings yet
Final Report (Partial)
40 pages
AI Detector Application Project Report
No ratings yet
AI Detector Application Project Report
5 pages
AI-Driven Student Success Prediction
No ratings yet
AI-Driven Student Success Prediction
58 pages
Personas for AI: Open Source Toolbox
No ratings yet
Personas for AI: Open Source Toolbox
41 pages
Screenshot 2025-10-11 at 12.14.34 PM
No ratings yet
Screenshot 2025-10-11 at 12.14.34 PM
83 pages
B.E Cse Batchno 176
No ratings yet
B.E Cse Batchno 176
83 pages
Quiz - Day Documentation
No ratings yet
Quiz - Day Documentation
51 pages
Project Report
No ratings yet
Project Report
35 pages
Pdfquery
No ratings yet
Pdfquery
68 pages
Miniproject
No ratings yet
Miniproject
43 pages
Guidelines of Event
No ratings yet
Guidelines of Event
7 pages
Automated Answer Grading System Using Machine Learning
No ratings yet
Automated Answer Grading System Using Machine Learning
25 pages
Final Project Sample Report
No ratings yet
Final Project Sample Report
69 pages
Project - Batch4 1 5
No ratings yet
Project - Batch4 1 5
5 pages
Reoprt
No ratings yet
Reoprt
42 pages
Drowsiness Detection Mail Alert System
No ratings yet
Drowsiness Detection Mail Alert System
46 pages
Sri Ramakrishna Institute of Technology: Department of Computer Science and Engineering
No ratings yet
Sri Ramakrishna Institute of Technology: Department of Computer Science and Engineering
14 pages
Final
No ratings yet
Final
82 pages
Automated Evaluation of Handwritten Answer Script Using
No ratings yet
Automated Evaluation of Handwritten Answer Script Using
4 pages
Enhancing OTT Platform Efficiency: Real-Time Monitoring With Advanced Analytics
No ratings yet
Enhancing OTT Platform Efficiency: Real-Time Monitoring With Advanced Analytics
65 pages
Voice Assistant Project with NLP & Deep Learning
No ratings yet
Voice Assistant Project with NLP & Deep Learning
82 pages
Chandu 1 2
No ratings yet
Chandu 1 2
63 pages
Deep Learning for Automated Answer Grading
No ratings yet
Deep Learning for Automated Answer Grading
8 pages
Relation Extraction with LLMs
No ratings yet
Relation Extraction with LLMs
68 pages
Final Doc - Project
No ratings yet
Final Doc - Project
55 pages
Industry Report
No ratings yet
Industry Report
41 pages
NLP Syn1
No ratings yet
NLP Syn1
7 pages
Unit - 1 Notes RER
No ratings yet
Unit - 1 Notes RER
17 pages
Syrian Revolution
No ratings yet
Syrian Revolution
1 page
AC380P278
No ratings yet
AC380P278
25 pages
Research Paper
No ratings yet
Research Paper
8 pages
Tug of War Game Design Lab Guide
No ratings yet
Tug of War Game Design Lab Guide
4 pages
2nd Term - Short Test (2) Specifications For Grade 6
No ratings yet
2nd Term - Short Test (2) Specifications For Grade 6
1 page
Nike Crocking Test SOP
No ratings yet
Nike Crocking Test SOP
14 pages
FLC Cummins Celect Plus
100% (3)
FLC Cummins Celect Plus
18 pages
DS Honors Sem 5 Syllabus
No ratings yet
DS Honors Sem 5 Syllabus
4 pages
Describing An Object - PPT
No ratings yet
Describing An Object - PPT
17 pages
Is Gtu Papers
No ratings yet
Is Gtu Papers
12 pages
Discount Factor and Compounding Table
No ratings yet
Discount Factor and Compounding Table
6 pages
Ideal Gas Equation Explained
No ratings yet
Ideal Gas Equation Explained
15 pages
OS Scheduler Types Explained
No ratings yet
OS Scheduler Types Explained
26 pages
CIST1305 Test 2 Review: Program Design
No ratings yet
CIST1305 Test 2 Review: Program Design
7 pages
Unit2 Maths IV
No ratings yet
Unit2 Maths IV
189 pages
Grade 8 & 10 Science Quarterly Tests
No ratings yet
Grade 8 & 10 Science Quarterly Tests
28 pages
W1 ANfB
No ratings yet
W1 ANfB
30 pages
Noun and Verb Phrase Elaboration
No ratings yet
Noun and Verb Phrase Elaboration
28 pages
Management Information System Overview
No ratings yet
Management Information System Overview
8 pages
EE 617 Autumn 2014 Midterm Exam Guide
No ratings yet
EE 617 Autumn 2014 Midterm Exam Guide
2 pages
Function Notation in Algebra 2
No ratings yet
Function Notation in Algebra 2
14 pages
Aircraft Materials Questions
No ratings yet
Aircraft Materials Questions
19 pages
FAS - Technical Submittal - Honyewell-Esser
No ratings yet
FAS - Technical Submittal - Honyewell-Esser
208 pages
Manual 7X1701
No ratings yet
Manual 7X1701
9 pages
PHYS 2426 Lab 6 Magnetic Field of A Solenoid
No ratings yet
PHYS 2426 Lab 6 Magnetic Field of A Solenoid
6 pages
Canon
100% (1)
Canon
134 pages
Mixing Chamber Design
No ratings yet
Mixing Chamber Design
11 pages
4-Oil Based Muds 1
100% (1)
4-Oil Based Muds 1
15 pages
GEO Lab Manual - 2024 bcv303
No ratings yet
GEO Lab Manual - 2024 bcv303
35 pages
Hyperledger Indy Public Blockchain: Presented by Alexander Shcherbakov
No ratings yet
Hyperledger Indy Public Blockchain: Presented by Alexander Shcherbakov
42 pages
Microbial Growth Requirements Explained
No ratings yet
Microbial Growth Requirements Explained
12 pages
1 BEI502 Process Instruemntation Humidity Measurement 17-9-24
No ratings yet
1 BEI502 Process Instruemntation Humidity Measurement 17-9-24
64 pages
English For Mathematics
No ratings yet
English For Mathematics
97 pages

AI Evaluation Project Report 20pages

Uploaded by

AI Evaluation Project Report 20pages

Uploaded by

Project Report on

“AI BASED EVALUATION TOOL FOR

COMPUTER SCIENCE AND ENGINEERING (ARTIFICIAL INTELLIGENCE AND MACHINE

Aalim Farooqui – 2200681530001, Harsh Kumar(2200681530038)

Abhishek Verma-2200681530008, Ashwin Pawar(2200681530026)

Under the Guidance of

Dr. Rambir Singh

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

(ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING)

MEERUT INSTITUTE OF ENGINEERING AND TECHNOLOGY, MEERUT

DR. A. P. J. ABDUL KALAM TECHNICAL UNIVERSITY, LUCKNOW

Date: December 05, 2025

Place: MIET, Meerut

Supervisor Signature HOD Signature

1.2 Problem Statement 2

CHAPTER 2: LITERATURE REVIEW 5

2.1 Evolution of AES 5

2.2 Semantic Analysis 6

2.3 Large Language Models 7

CHAPTER 3: SYSTEM ANALYSIS 8

3.1 Existing System 8

3.2 Proposed System 9

3.3 Feasibility Study 10

4.1 Hybrid Architecture 12

4.2 OCR Pipeline 13

4.3 Custom ML Model 14

5.1 Tech Stack 16

5.2 Module Description 17

1.2 PROBLEM STATEMENT

The primary objective of this project is to develop a production-ready, cloud-hosted web

Specific objectives include:

1.4 SCOPE OF THE PROJECT

The scope of this project is defined by the following boundaries:

2.1 HISTORICAL EVOLUTION OF AES

2.2 LATENT SEMANTIC ANALYSIS (LSA)

2.3 NEURAL NETWORKS AND EMBEDDINGS

2.4 LARGE LANGUAGE MODELS (LLMS)

3.1 EXISTING SYSTEM

Disadvantages of the Current System:

3.2 PROPOSED SYSTEM

Advantages of the Proposed System:

3.4 REQUIREMENT ANALYSIS

4.1 HYBRID SCORING ARCHITECTURE

Final Score = (α * S_LLM) + (β * S_ML)

4.2 OPTICAL CHARACTER RECOGNITION (OCR) PIPELINE

2. Preprocessing (Binarization): Scanned documents often contain noise, shadows, or gray

4.3 CUSTOM MACHINE LEARNING MODEL

4.4 SEMANTIC ANALYSIS VIA GEMINI

The Prompt Strategy:

5.1 TECHNOLOGY STACK

5.2 MODULE DESCRIPTION

5.3 INTERFACE DESIGN

6.1 PERFORMANCE METRICS

6.2 COMPARATIVE ANALYSIS

We compared the system against two baseline models:

7.2 FUTURE SCOPE

Yancey, K. P. (2023). Prompt engineering for AI evaluation: A case study in academic

You might also like