<font face='georgia'>
<h4><strong>1. Build a TFIDF Vectorizer & compare its results with Sklearn:</strong></h4>
<ul>
<li> As a part of this task you will be implementing TFIDF vectorizer on a collection of text documents.</li>
<br>
<li> You should compare the results of your own implementation of TFIDF vectorizer with that of sklearns implemenation TFIDF vectorizer.</li>
<br>
<li> Sklearn does few more tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:
<ol>
<li> Sklearn has its vocabulary generated from idf sroted in alphabetical order</li>
<li> Sklearn formula of idf is different from the standard textbook formula. Here the constant <strong>"1"</strong> is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.<br>
$IDF(t) = 1+\log_{e}\frac{1\text{ }+\text{Total number of documents in collection}} {1+\text{Number of documents with term t in it}}.$
</li>
<li> Sklearn applies L2-normalization on its output matrix.</li>
<li> The final output of sklearn tfidf vectorizer is a sparse matrix.</li>
</ol>
<br>
<li>Steps to approach this task:
<ol>
<li> You would have to write both fit and transform methods for your custom implementation of tfidf vectorizer.</li>
<li> Print out the alphabetically sorted voacb after you fit your data and check if its the same as that of the feature names from sklearn tfidf vectorizer. </li>
<li> Print out the idf values from your implementation and check if its the same as that of sklearns tfidf vectorizer idf values. </li>
<li> Once you get your voacb and idf values to be same as that of sklearns implementation of tfidf vectorizer, proceed to the below steps. </li>
<li> Make sure the output of your implementation is a sparse matrix. Before generating the final output, you need to normalize your sparse matrix using L2 normalization. You can refer to this link https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html </li>
<li> After completing the above steps, print the output of your custom implementation and compare it with sklearns implementation of tfidf vectorizer.</li>
<li> To check the output of a single document in your collection of documents, you can convert the sparse matrix related only to that document into dense matrix and print it.</li>
</ol>
</li>
<br>
</ul>
<p> <font color="#e60000"><strong>Note-1: </strong></font> All the necessary outputs of sklearns tfidf vectorizer have been provided as reference in this notebook, you can compare your outputs as mentioned in the above steps, with these outputs.<br>
<font color="#e60000"><strong>Note-2: </strong></font> The output of your custom implementation and that of sklearns implementation would match only with the collection of document strings provided to you as reference in this notebook. It would not match for strings that contain capital letters or punctuations, etc, because sklearn version of tfidf vectorizer deals with such strings in a different way. To know further details about how sklearn tfidf vectorizer works with such string, you can always refer to its official documentation.<br>
<font color="#e60000"><strong>Note-3: </strong></font> During this task, it would be helpful for you to debug the code you write with print statements wherever necessary. But when you are finally submitting the assignment, make sure your code is readable and try not to print things which are not part of this task.
</p>
The section $IDF(t) = 1+\log_{e}\frac{1\text{ }+\text{Total number of documents in collection}} {1+\text{Number of documents with term t in it}}.$ didnt get interpreted properly however I can see this properly interpreted in the Jupyter notebook in the browser.
Should interpret the code properly to form the formulae.
Please provide as much info as you readily know
Microsoft Data Science for VS Code Engineering Team: @rchiodo, @IanMatthewHuff, @DavidKutu, @DonJayamanne, @greazer
This seems to be nteract's fault. I can repro the same problem here:
https://components.nteract.io/#markdown (that's the markdown renderer we use)
At that nteract example, I changed the first cell to be like so:
<Markdown data={`<font face="georgia">
<h4><strong>1. Build a TFIDF Vectorizer & compare its results with Sklearn:</strong></h4>
<ul>
<li> As a part of this task you will be implementing TFIDF vectorizer on a collection of text documents.</li>
<br>
<li> You should compare the results of your own implementation of TFIDF vectorizer with that of sklearns implemenation TFIDF vectorizer.</li>
<br>
<li> Sklearn does few more tweaks in the implementation of its version of TFIDF vectorizer, so to replicate the exact results you would need to add following things to your custom implementation of tfidf vectorizer:
<ol>
<li> Sklearn has its vocabulary generated from idf sroted in alphabetical order</li>
<li>Sklearn formula of idf is different from the standard textbook formula. Here the constant <strong>"1"</strong> is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions.<br>
$IDF(t) = 1+\log_{e}\frac{1\text{ }+\text{Total number of documents in collection}} {1+\text{Number of documents with term t in it}}.$</li>
<li> Sklearn applies L2-normalization on its output matrix.</li>
<li> The final output of sklearn tfidf vectorizer is a sparse matrix.</li>
</ol>
<br>
<li>Steps to approach this task:
<ol>
<li> You would have to write both fit and transform methods for your custom implementation of tfidf vectorizer.</li>
<li> Print out the alphabetically sorted voacb after you fit your data and check if its the same as that of the feature names from sklearn tfidf vectorizer. </li>
<li> Print out the idf values from your implementation and check if its the same as that of sklearns tfidf vectorizer idf values. </li>
<li> Once you get your voacb and idf values to be same as that of sklearns implementation of tfidf vectorizer, proceed to the below steps. </li>
<li> Make sure the output of your implementation is a sparse matrix. Before generating the final output, you need to normalize your sparse matrix using L2 normalization. You can refer to this link https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html </li>
<li> After completing the above steps, print the output of your custom implementation and compare it with sklearns implementation of tfidf vectorizer.</li>
<li> To check the output of a single document in your collection of documents, you can convert the sparse matrix related only to that document into dense matrix and print it.</li>
</ol>
</li>
<br>
</ul>
<p> <font color="#e60000"><strong>Note-1: </strong></font> All the necessary outputs of sklearns tfidf vectorizer have been provided as reference in this notebook, you can compare your outputs as mentioned in the above steps, with these outputs.<br>
<font color="#e60000"><strong>Note-2: </strong></font> The output of your custom implementation and that of sklearns implementation would match only with the collection of document strings provided to you as reference in this notebook. It would not match for strings that contain capital letters or punctuations, etc, because sklearn version of tfidf vectorizer deals with such strings in a different way. To know further details about how sklearn tfidf vectorizer works with such string, you can always refer to its official documentation.<br>
<font color="#e60000"><strong>Note-3: </strong></font> During this task, it would be helpful for you to debug the code you write with print statements wherever necessary. But when you are finally submitting the assignment, make sure your code is readable and try not to print things which are not part of this task.
</p>`} />
Thanks for the feedback! However, we don't have plans on adjusting our functionality at this time to fix this problem.