Home > Sample essays > Maximize Student Potential Through Watson at RIT

Essay: Maximize Student Potential Through Watson at RIT

Essay details and download:

  • Subject area(s): Sample essays
  • Reading time: 11 minutes
  • Price: Free download
  • Published: 1 April 2019*
  • Last Modified: 23 July 2024
  • File format: Text
  • Words: 3,327 (approx)
  • Number of pages: 14 (approx)

Text preview of this essay:

This page of the essay has 3,327 words.



% This is “sig-alternate.tex” V1.9 April 2009

% This file should be compiled with V2.4 of “sig-alternate.cls” April 2009

%

% This example file demonstrates the use of the ‘sig-alternate.cls’

% V2.4 LaTeX2e document class file. It is for those submitting

% articles to ACM Conference Proceedings WHO DO NOT WISH TO

% STRICTLY ADHERE TO THE SIGS (PUBS-BOARD-ENDORSED) STYLE.

% The ‘sig-alternate.cls’ file will produce a similar-looking,

% albeit, ‘tighter’ paper resulting in, invariably, fewer pages.

%

% —————————————————————————————————————-

% This .tex file (and associated .cls V2.4) produces:

%  1) The Permission Statement

%  2) The Conference (location) Info information

%  3) The Copyright Line with ACM data

%  4) NO page numbers

%

% as against the acm_proc_article-sp.cls file which

% DOES NOT produce 1) thru’ 3) above.

%

% Using ‘sig-alternate.cls’ you have control, however, from within

% the source .tex file, over both the CopyrightYear

% (defaulted to 200X) and the ACM Copyright Data

% (defaulted to X-XXXXX-XX-X/XX/XX).

% e.g.

% \CopyrightYear{2007} will cause 2007 to appear in the copyright line.

% \crdata{0-12345-67-8/90/12} will cause 0-12345-67-8/90/12 to appear in the copyright line.

%

% —————————————————————————————————————

% This .tex source is an example which *does* use

% the .bib file (from which the .bbl file % is produced).

% REMEMBER HOWEVER: After having produced the .bbl file,

% and prior to final submission, you *NEED* to ‘insert’

% your .bbl file into your source .tex file so as to provide

% ONE ‘self-contained’ source file.

%

% ================= IF YOU HAVE QUESTIONS =======================

% Questions regarding the SIGS styles, SIGS policies and

% procedures, Conferences etc. should be sent to

% Adrienne Griscti (griscti@acm.org)

%

% Technical questions _only_ to

% Gerald Murray (murray@hq.acm.org)

% ===============================================================

%

% For tracking purposes – this is V1.9 – April 2009

\documentclass{sig-alternate}

\pdfpagewidth=8.5truein

\pdfpageheight=11truein

\usepackage{url}

\usepackage{listings}

\lstset{

  frame=single,

  breaklines=true,

  postbreak=\raisebox{0ex}[0ex][0ex]{\ensuremath{\hookrightarrow\space}}

}

\begin{document}

%

% — Author Metadata here —

%\conferenceinfo{SAC’15}{April 13-17, 2015, Salamanca, Spain.}

%\CopyrightYear{2015} % Allows default copyright year (2002) to be over-ridden – IF NEED BE.

%\crdata{978-1-4503-3196-8/15/04}  % Allows default copyright data (X-XXXXX-XX-X/XX/XX) to be over-ridden.

% — End of Author Metadata —

\title{Information Gain From Rochester Institute of Technology}

\subtitle{Watson Project}

%

% You need the command \numberofauthors to handle the ‘placement

% and alignment’ of the authors beneath the title.

%

% For aesthetic reasons, we recommend ‘three authors at a time’

% i.e. three ‘name/affiliation blocks’ be placed beneath the title.

%

% NOTE: You are NOT restricted in how many ‘rows’ of

% “name/affiliations” may appear. We just ask that you restrict

% the number of ‘columns’ to three.

%

% Because of the available ‘opening page real-estate’

% we ask you to refrain from putting more than six authors

% (two rows with three columns) beneath the article title.

% More than six makes the first-page appear very cluttered indeed.

%

% Use the \alignauthor commands to handle the names

% and affiliations for an ‘aesthetic maximum’ of six authors.

% Add names, affiliations, addresses for

% the seventh etc. author(s) as the argument for the

% \additionalauthors command.

% These ‘additional authors’ will be output/set for you

% without further effort on your part as the last section in

% the body of your article BEFORE References or any Appendices.

\numberofauthors{5} %  in this sample file, there are a *total*

% of EIGHT authors. SIX appear on the ‘first-page’ (for formatting

% reasons) and the remaining two appear in the \additionalauthors section.

%

\author{

% You can go ahead and credit any number of authors here,

% e.g. one ‘row of three’ or two rows (consisting of one row of three

% and a second row of one, two or three).

%

% The command \alignauthor (no curly braces needed) should

% precede each author name, affiliation/snail-mail address and

% e-mail address. Additionally, tag each line of

% affiliation/address with \affaddr, and tag the

% e-mail address with \email.

%

% 1st. author

\alignauthor

Shreyas Sureja\\

\affaddr{Rochester Institute of Technology}\\

\email{srs1521@rit.edu}

% 2nd. author

\alignauthor Nathaniel Cotton \\

\affaddr{Rochester Institute of Technology}\\

\email{nec2887@rit.edu}

% 3rd. author

\alignauthor Phillip Lopez\\

\affaddr{Rochester Institute of Technology}\\

\email{pgl5711@rit.edu}

\and  % use ‘\and’ if you need ‘another row’ of author names

% 4th. author

\alignauthor Ankit Bhankharia\\

\affaddr{Rochester Institute of Technology}\\

\email{atb5880@rit.edu}

% 5th. author

\alignauthor Uday Wadhone\\

\affaddr{Rochester Institute of Technology}\\

\email{uw1919@rit.edu}

}

% There’s nothing stopping you putting the seventh, eighth, etc.

% author on the opening page (as the ‘third row’) but we ask,

% for aesthetic reasons that you place these ‘additional authors’

% in the \additional authors block, viz.

\date{}

% Just remember to make sure that the TOTAL number of authors

% is the number that will appear on the first page PLUS the

% number that will appear in the \additionalauthors section.

\maketitle

\begin{abstract}

\end{abstract}

% A category with the (minimum) three required fields

%A category including the fourth, optional field follows…

\keywords{IBM Watson, Cognitive Computing, Rochester Institute of Technology}

\section{Introduction}

\label{intro}

Many Universities will spend thousands of dollars to attract students to enroll in their university \cite{school_ad}. A great deal of this money is spent on convincing students that this is the university for them.  However, in most cases each student is unique, and desires unique things from a university.  To accommodate the interests of each of these students, university recruiters need to be well versed in what the university has to offer, and stay up to date on the universities programs. This requires the university to hire people, and fund recruitment programs.

This study is exploring the use of IBM Watson to ingest a university’s publicly available content to produce a system in which prospective students can ask Watson questions \cite{watson} about the school. By publicly available, this study means content that is posted on the university’s public website.  This solves two parts of the above stated problem.  The first being that a well trained instance of Watson may be able to compete with a trained recruiter as far as answering student questions go.  If so, fewer recruiters would needed to be hired in order to support recruitment programs, which saves the university money. The second part is that many large universities are constantly changing.  In those cases Watson can easily ingest the new information, receive additional training, and be prepared to answer student’s questions.

It is important to note that this technology would not be limited to simply recruiting, but could also be informative for current and former students.  In most cases students at a university are not well aware of all aspects of what the university has to offer.

\section{Project Plan}

\label{plan}

As mentioned in section \ref{intro}, Watson needs to be able to ingest large quantities of data in order to be effective at providing expertise on the domain.  However, Watson is unable to learn directly from the data, it does need to be trained by an expert that can determine the correctness of a response.  For this study the initial portion of the project can be broken down into three distinct categories data collection (section \ref{collect}), data preparation (section \ref{prep}), and question generation (section \ref{questions}).

\subsection{Data Collection}

\label{collect}

The data collection process involves pulling HTML documents, or documents that are represented by HTML (php). Fortunately, there is a nice tool called wget that can be used to specify a particular web domain, and it will pull these documents from the specified site \cite{wget}. A number of flags can be provided to this utility to identify what should be collected (only HTML files), and how the data should be collected.  For this project the command used by the wget utility is formatted as follows:

\begin{lstlisting}

wget -r -p -e robots=off –no-check-certificate -U mozilla http://cs.rit.edu

\end{lstlisting}

Since some of RIT’s subdomains have their robots.txt files indicating that crawlers should not pull portions of the site, a simple wget will not suffice. Instead, the robots=off and the user agent being set to mozilla allows the command to pull information from these sites without being blocked by the configurations.  Once pulled all of the files exist within in one directory, each file representing a single HTTP request to RIT’s web servers \cite{rit_site}.

\subsection{Data Preparation}

\label{prep}

Once the wget utility has run, it produces a directory filled with the files pulled from the site, two steps need to be taken.  The first is that the information that is not usable, or at least easily usable by Watson needs to be stripped away from the documents \cite{watson}.  In general the useless information can be considered image tags, and script tags.  Other components of the files while potentially useless remain because it is quite difficult to determine what may or may not be useful for answering questions.  It’s best to allow Watson to determine that for itself.  The next issue is that the version of Watson that has been made available for this project does not allow for mass deletes from its corpus.  Therefore, uploading an entire site, which generally consists of several thousands of documents will be cumbersome to handle given Watson’s interface.  Therefore all of the content held within each individual document will need to be combined into a single document.  That document will then be uploaded to Watson for parsing.

\subsection{Question Generation}

\label{questions}

Watson by itself as mentioned earlier is unable to extract accurate meaning from the documents that it is provided with.  Therefore Watson needs to be trained on how to handle particular questions.  This training process will involve inputting approximately 400 questions and answers to provide Watson with training data. These questions will be generated by hand.  Each project team member will be given a different portion of the site that they will generate questions for.  Once the question is added to Watson, the team member will search through the answers that Watson provides to determine if one is suitable for the question.  If the Watson provided answers are not deemed suitable, the next step is to manually select the portion of text that would answer the question.  Finally, if the documents cannot produce a reasonable answer to the question, then there are two options.  One is to remove the question from the database, deeming it to not be an adequate question to ask.  The other is to generate an answer in a new document and add it to the corpus, then select that as the answer to the question. The process follows figure \ref{fig:question_generation}.

After this training process the instance of Watson should be able to draw the correct associations between the corpus of information and the questions being asked. However, if Watson is unable to produce adequate results it will be important for the data collection and data preparation stages to be revisited.  This is to put the information in a way that Watson can more easily understand.  This process will be continued until adequate results are produced.

It is important to note that while Watson is a powerful tool there is always a chance that Watson is unable to become an expert in a particular domain.  Therefore if a sufficient number of attempts have been made to train Watson, it may be learned that Watson cannot handle this particular domain.

\section{Results}

\label{results}

We were able to successfully train Watson to answer simple questions about RIT and the Computer Science Department based on website data we provided. In the test deployment of Watson, we were able to ask and receive correct answers to most of our test questions.

We tested Watson on all of the expert question and answer pairs used to train Watson, and measured how accurate they were answered. We, as domain knowledge experts on RIT, went back and validated all questions. Questions that had a clearly correct response we marked counted toward our accuracy. Answers that provided the wrong information or provided no information counted against the accuracy. Questions that may have had the correct answer, but returned a large response where the correct answer was not within the first few sentences were also considered wrong and counted against the accuracy in our measurement.  

Our instance answered with an accuracy of 83\% on the training questions. It answered 402 questions correctly while missing 82. The instance was also subjected to 50 new questions designed on the same subject matter that Watson was trained on. For example, for the “RIT CS Study Abroad Program”, the instance was trained on questions related to Croatia as the destination. This destination was changed to Germany for the new test questions. From these metrics we can see that Watson was able to answer many of the questions we trained it on. However, the metrics also show that Watson is not trained on a 1:1 basis with the questions. It is likely that the questions Watson failed to answer in the test phase need more examples provided during the expert training phase in order to be answered accurately later on.

\section{Challenges}

The Watson platform was generally easy to work with and not error prone; however, there were some issues that showed up with our data and how we performed out expert training. One of the biggest problems our instance suffered was that some questions required too long to answer in the test phase. Some questions could take almost 5-10 seconds to retriever an answer. This is too long for quick question and answer applications which users generally expect an answer in under two seconds. The cause of this slowness was our decision to combine all of the data from the RIT public website into a single file. The file size totaled to 36MB of text data. The Watson service did not appear to break up this file during it’s own pre-processing stage and therefore lookups to questions in this file ended up taking longer than in other files.

An additional challenged faced was that Watson does keep 1:1 question answer pairs from the expert training. This means a single question asked and paired with a response during the training stage may not have an answer, or the correct answer, in the test stage. The solution to this issue is to ask the same question many different ways to ensure that Watson is trained to answer it.

The most time consuming part of working with Watson was not the data preparation or cleaning phase, but rather the expert training. The tools build into Watson for detecting duplicate questions and matching similiar questions help alleviate this by making it easier for multiple people to train the system at once. However, it still requires many manual hours of thinking of questions and matching them with the correct answers.

\section{Learnings}

\label{learnings}

One of the ways that Watson can be trained better is to understand the difference between low-impact and high-impact training. There are two basic methods that can be used to train Watson, one of which is matching the current question to a previously answered question. Another method is to match an entire block of text found by Watson. A more specific way of answering the question is to select a subset of text from this block. This is known as high-impact training. Low-impact training requires rebuiliding of corpus for Watson to be trained with these questions. High-impact training involves machine learning aspects of Watson.

\section{Ethical Issues}

\label{ethics}

Whenever any new technology is created, or before it is created a critical question must be asked. Should this technology exist?  The technology used for this project was IBM’s Watson, which is a cognitive computing platform that is able to understand documents through natural language processing \cite{watson}. The main purpose of this technology is to train a machine to be a domain expert so there would not be a need for human domain expert.

If an instance of Watson is appropriately trained on a particular domain then there would not be a need for as many domain experts to service all those who have questions.  As a result it may increase the unemployment rate of those domain experts.  However, a different perspective could be taken in that with this cognitive computing tool it is possible to service more entities that are requesting the knowledge.  Unfortunately, the laws of supply and demand would apply in this scenario, which would push the wages of domain experts down.  Thus creating a disincentive to go into the field.

That is fairly bleak outlook on what cognitive computing could do. An alternate possibility is that cognitive computing, such as Watson could alleviate some of the pressures the domain expert may feel to provide themselves to the public.  Instead of making themselves available they could use their knowledge to study deeper spaces of the domain to generate new knowledge.  With this new knowledge generation the domain may experience rapid growth, and experience massive benefits.

One additional fear of having a machine as a domain expert is that if the machine where to be used by a malicious entity the knowledge could be lost.  A thought exercise to show this could be found in an instance of Watson being trained in military strategy.  If this instance of Watson were to be stolen another entity could garner a deeper understanding of military strategy.  The major issue presented here is that the type of data used for the training could pose an ethical issue.  As the information used in this exercise would be classified as top-secret, as such should be treated with extreme caution.

With regards to the work done on this project our instance of Watson uses only public information from the RIT website and does not contain any private information.  As a result the previously mentioned concern of the type of data being used for the training process can be ignored.  However, the bleak outlook could be applicable to this project, as it could cause the need for college recruiters to decline.  This could have a significant impact on recruiting profession.  Or the upbeat outlook could be applicable as recruiters could focus more on convincing students that the particular university is perfect for them, as opposed to spurring initial interest in the university.

\section{Conclusion \& Future Work}

\label{future}

We would like Watson to give answers that are based on real time updates. If Watson is asked about a particular Intramurals game timing, it should be able to come up with the timing by accessing the real time data on Intramurals website which is updated quite frequently. We might also consider widening the domain of Watson where it can answer questions related to each faculty and his/her research

area and their achievements in respective field.

\section{Appendix A}

\label{app_a}

The instance of Watson is trained to handle generalized questions about RIT that prospective and current students might ask about the campus, admissions, types of degrees, or events. The system is also trained to answer in-depth questions about the Computer Science Department from information from the public CS website.

The system can also be questioned on various majors provided by RIT and also the variety of courses available in each department. International students can ask questions based on housing, tution fees, taxes, etc. Incoming students can ask questions about admission procedures, location of important offices like financial aid, career services, etc..  If system fails to give desired answer then the questions can be altered and asked in different fashion to get the required answer. Using proper keyword while asking a question is key to getting the correct answer from Watson.

\section{Appendix B}

\label{app_b}

Individual contributions of the team members:

% write your contributions here

\paragraph{Shreyas Sureja}

\paragraph{Nathaniel Cotton}

\paragraph{Phillip Lopez}

\paragraph{Ankit Bhankharia}

\paragraph{Uday Wadhone}

\section{Figures}

\label{figures}

\begin{figure}[!htb]

\includegraphics[width=0.5\textwidth]{question_generation}

\caption{Question Generation}

\label{fig:question_generation}

\end{figure}

%

% The following two commands are all you need in the

% initial runs of your .tex file to

% produce the bibliography for the citations in your paper.

\bibliographystyle{abbrv}

\bibliography{sigproc}  % sigproc.bib is the name of the Bibliography in this case

\end{document}

About this essay:

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, Maximize Student Potential Through Watson at RIT. Available from:<https://www.essaysauce.com/sample-essays/2016-12-4-1480891309/> [Accessed 18-05-26].

These Sample essays have been submitted to us by students in order to help you with your studies.

* This essay may have been previously published on EssaySauce.com and/or Essay.uk.com at an earlier date than indicated.