PROJECT VIDEO

ABSTRACT

Document categorization is process of categorizes the document based on their content. Document categorization in today world is most critical problem because most document categorization in organizations is done manually. We save at work hundreds of files and e-mail messages in folders every day. This reason companies need of automatic classification or categorization of documents. Document categorization reduced a lot of burden of company or organization Document classification, also known as document categorization, is the process of assigning documents to categories. A particular document may fit into two or more different categories. Without the categorization of documents will leads to increase the search time. But use of this process will reduce the search time and burden of working employees in particular organization. Document categorization process is also known as text classification.

LITERATURE REVIEW

Document categorization is not new idea. It has been developed in 1950.The main purpose of document categorization is convert unstructured document into structured form .so various approaches can develop by scientist to solve this problem.
Debnath Bhattacharya[1] that discuss the history of document categorization. In this show Different approaches to solve problem of document categorization. In this paper show that text Data mining approaches. In this paper describe that different algorithm can be proposed to Convert unstructured document structured document. The approaches like lexical chain  Method, linearization method and neural network approaches. It explain how it can work and  Also explain the limitations of methods.
Dina Goren-Bar and Tsvi Kuflik [2] that discuss the document the categorization using Lava(learning vector organization) and SQM (self organizing maps). In this papers there is evaluate and compare the results document classification using learning vector  organization and self organizing maps .The main purpose of this work was to evaluate the possibility of automating the classification of subjectively categorized data sets. The classification by using self organization map and learning vector organization that provides the accuracy in the result.
Hang lie, kanji Yamanishi [3] that discuss the framework of Document classification using finite mixture model. It is a new method to document to classify the document into categories. In this define the each category a finite mixture model based on soft clustering of words . Then conducting the statistical hypothesis. The main problem is classifying the document into no of categories. Each categories also determine already containing the newly documents also determine which category contain newly document to be assigned. To address this issue, the method is also called the hard clustering. But this method is degrading classification of results. But this problem can be solved by soft
clustering, in soft clustering use the finite mixture model .This model classifying the document based on the soft clustering of words.
Zhihang Chen that represents categorization of documents by using of different neural network approaches [4]. In this paper described single neural network technique is not efficient task for categorizing the collection of documents. So in this describe the hierarchical neural networks for document categorization. These hierarchical neural networks that provides the efficiency in document classification
G.S Thakur [5]described the new framework for document categorization/ classification  to achieve the efficiency and accuracy in comparison to other techniques. The Previous technique that is applied on the document classification does not provide satisfied result. In binary model each row indicates the no of documents , each column indicates the term in document. This method convert unstructured document into a structured document representation. Text classification is based on supervised learning model. In this learning we divided our dataset into two parts. One part is called training dataset and anot her part is called test dataset
Quire Zhang and jinghua [6] that represent medical document categorization using naive  byes approach. In this paper described that in document classification indexing is very important because its helps classify. The documents based predetermined classes . In this also discuss the various effects of training sets in document classification. By using this method represent improved performance of document classifications.
Riel’s and boonyasopon [7]that represent the data mining approach to document classification. In this paper knowledge mining approach is used for document classification . To text analysis tool can be used for purpose of mine the knowledge by analyzing the different documents that can be categories according to different domains. In this paper to categories the document on the purpose extract the knowledge by using of knowledge based data mining approach. Manhood soltani, Mohammad taher [8]that represent the classification of textual document by using learning vector organization. In this paper to represent the class by vector called codebook . This helps to identify the different features in text documents to categories the different documents. These methods of document classification use the less training set and this method is faster than other document categorization methods that used f or this purpose.
Dr.S.R.Suresh, T.Karthikeyan, D.B.Shanmugam, J.Dhilipan [9] that presented use reinforcement learning for document classification. The learning measures the utility of action that provide the benefits in future. It provides the model for document classification is Q- learning algorithm. This algorithm performs sub phases for document classification. This algorithm, determine Q -value of documents of each category that belongs to particular domain. During learning process maps text with the Q-value the determine for each categories. We also identify that the use of reinforcement learning technique significantly improve the efficiency of text document classifiers and show that reinforcement learning is a prominent technique. In this classification process text  documents of specific domain are completely gone through for classification and thus improve the accuracy.

SIGNIFICANCE

The main significance of proposed document categorization system is very useful for organization because every organization holds the large amount of data like electronic document, email messages if it can be categorize then search time to particular document is reduce and if not then it leads to lot of effort to manage it. Today most of organization done this process manually. So increase the lot of burden on employees in particular organization. We save at work hundreds of files and e-mail messages in folders every day. This reason companies need of automatic classification or categorization of documents. Document categorization reduced a lot of burden of company or organization. This will also increase the search time of particular document. The process of categorizes the documents help to reduce this time. In today world increase the wide variety of electronic documents its needs to be categorizes. If different documents in organization is not categorize than this leads to decrease search time of documents The study of document categorization/classification is main importance is by increasing the no of electronic documents from different types of resources. These resources include electronic document, news articles and electronic mail. This resource sometimes is structured, semi-structured and unstructured. From this resources extract information is very important research area. So the process of categorizing the documents by assign the category based on their content. The process of categorizing the document it ’s very important need because without categorizing the document to search the particular document is very difficult and extract the knowledge from this document is very difficult process. By use document categorization software its help to decrease this type of difficulties.

OBJECTIVES

The main objective of document classification is assigning predefined category to documents. Traditionally this process done by experts or employees in organization manually. Firstly read the documents properly and then assign number of categories according to the predefined categories. So my objective is reducing this burden. so  design a such system that that fully fill this requirement. The system or software for  document categorization that can be implemented is use full to applicable wide variety of applications like topic spotting, email routing language guessing and spam filtering. The objective of document categorization is reduce to detail and diversity of data and
result will be large amount of data is overloaded to similar types of documents. The document categorization/classification is combination of two process document categorization and document clustering. The main difference is that document categorization is categorizes the document and its supervised approach clustering to collect the similar types of object and its unsupervised approach.

MATLAB SOURCE CODE

Instructions to run the code

1- Copy each of below codes in different M files.
2- Place all the files in same folder

3- Use the files from below link and download them into the same folder as the source codes

dictionary       new

4- Also note that these codes are not in a particular order. Copy them all and then run the program.
5- Run the “final_file.m” file

Read me file

please follow the following instruction to successfully run the program

# THE PROGRAM WILL RUN ONLY ON TEXT FILES. SO PLEASE CONVERT YOUR DOCUMENT IN .txt FORMAT

# REGARDING THE EXCEL FILE "dictionary.xls"

1. the excel file "dictionary.xls" should not be deleted in any case

2. if it is deleted, make a new file with same name and same extension

3. the first row of the excel file contains all the topics into which the document is to be sorted

4. leave the row 2 empty

5. write down the supporiting words which will support the title in first row in the same column

# REGARDING THE TEXT FILE "new.txt"


1. the excel file "new.txt" should not be deleted in any case

2. if it is deleted, make a new file with same name and same extension

3. keep your desired text in this file for detection

Code 1 – Script M File – final_file.m

close all
clear all
clc

disp('%%%%%%%%%%%%% WELCOME TO THE DOCUMENT CHARACTERIZATION SYSTEM %%%%%%%%%%%%%')
disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')
% function to input the file to be tested
disp('Inputting the file...')
disp('Press enter to continue...')
pause
fid=file_loc_inp;
disp('File inputted')
disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')

% checking for a match from the dictionary
disp('Checking for a match from the dictionary')
disp('This might take time according to the inputted file')
disp('or the dictionary size')
disp('Press enter to continue...')
pause
[num,txt,raw]=xlsread('dictionary');
%txt;
[row col]=size(txt);
result=[];
dict_words={};
for i=1:col    
    mat=[];
    for j=1:row
        fid=file_loc_inp;                       
        [r1 c1]=size(char(txt(j,i)));
        if c1==0
            iteration_num=0;
            mat=[mat iteration_num];            
        else
            [iteration_num]=check_file(fid,txt(j,i));        
            mat=[mat iteration_num];
        end
        dict_words(j,i)={char(txt(j,i))};        
    end
    result(:,i)=mat';
end
result;
disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')
disp('Press enter for final results')
pause
disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')
[results,wordcnt,unq_wrdcnt] = wordcount;

disp('Total number of words: ')
disp(wordcnt)
disp('Number of unique words: ')
disp(unq_wrdcnt)

disp('FINAL OUTPUT COMPRISING OF THE FREQUECNY AND RELATUVE FREQUENCY OF THE ')
disp('WORDS FROM DICTIONARY')
final_output=table_func(result,dict_words,results,wordcnt);
disp(final_output)

disp('OVERALL WORDCOUNT AND FREQUENCY')
disp(results)

disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')
% to check the category
if find(result)==0;
    disp('The document inputted doesnt belong to any of the category')
else
    [index]=category(result);
    disp('File check complete...')
    disp('The category of the document is: ')
    txt(1,index)
end
disp('%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%')

Code 2 – Function M File -table_func.m

function final_output=table_func(result,dict_words,results,wordcnt)

[r c]=size(result);
mat=[];
mat_a={};
mat_b=[];
mat_c=[];
ind=1; % row number for mat_a, mat_b, mat_c
in=2;
for i=1:r
    for j=1:c
        if result(i,j)~=0
            output(in,1)={char(dict_words(i,j))};
            mat_a(ind,1)={char(dict_words(i,j))};
            
            output(in,2)={result(i,j)};
            mat_b(ind,1)=result(i,j);
            
            output(in,3)={((result(i,j))/wordcnt)*100};
            mat_c(ind,1)=((result(i,j))/wordcnt)*100;
            
            mat=[mat result(i,j)];
            in=in+1;
            ind=ind+1;
        end
    end
end

clear in
[B,IX]=sort(mat,'descend');
final_output={'DICTIONARY WORD' 'FREQUENCY' 'RELATIVE FREQUENCY (%)'};
[r c]=size(output);
s=1;
for i=2:r    
    final_output(i,1)={char(mat_a(IX(s),1))};
    final_output(i,2)={mat_b(IX(s),1)};
    final_output(i,3)={((mat_b(IX(s),1))/wordcnt)*100};
    s=s+1;
end    

end

Code 3 – Function M File -wordcount.m

function results = wordcount( filenam, num)


% First import the words from the text file into a cell array

[FileName,PathName] = uigetfile('*.txt','Select any text file');
y= [PathName,FileName];
fid = fopen(y);
words = textscan(fid,'%s');


for i=1:numel(words{1,1})
    ind = find(isstrprop(words{1,1}{i,1}, 'alphanum') == 0);

    words{1,1}{i,1}(ind)=[];
    
end

% Remove entries in words that have zero characters
for i = 1:numel(words{1,1})
    if size(words{1,1}{i,1}, 2) == 0
        words{1,1}{i,1} = ' ';
    end
end

% Now count the number of times each word appears
unique_words = unique(words{1,1});

freq = zeros(numel(unique_words), 1);

for i = 1:numel(unique_words)
    if max(unique_words{i} ~= ' ')
        for j = 1:numel(words{1,1})
            if strcmp(words{1,1}(j), unique_words{i})
                freq(i) = freq(i) + 1;
            end
        end
    end
end


% Finally, print out the results

u_freq = unique(freq);

sorted_freq = sort(u_freq, 'descend');

results={ 'WORD' 'FREQ' 'REL. FREQ' };

for i = 1:min(numel(find(sorted_freq > 1)), 10)
    ind = find(freq == sorted_freq(i));
    results{i+1, 1} = unique_words{ind};

    results{i+1, 2} = unique(freq(ind));
    results{i+1, 3} = sprintf('%.4f%s', unique(freq(ind)/numel(words{1,1}))*100, '%');
end

sprintf('The words that appeared more than once are displayed below\nTotal number of words in "%s" = %d\nTotal number of unique words = %d', 's2.txt', numel(words{1,1}), numel(find(freq)))

sprintf('display Frequency of Word  "%d"',sorted_freq(1))
sprintf('display word of High Frequency char "%s"',results{2, 1})
sc=results{2, 1};
new_catag={'science','computer','english'};
for k = 1:3
if(strcmp(sc,new_catag(k))==1)
    sprintf('This document comes under catagory:%s',sc)
else
     sprintf('This document  not comes under catagories that is  defined ')
    
end

end
fclose(fid);

Code 4 – Function M File -wordcount.m

%function results = wordcount( filenam, num)
function [results,wordcnt,unq_wrdcnt] = wordcount

% First import the words from the text file into a cell array
%  [FileName,PathName] = uigetfile('*.txt','Select any text file'); %to promt user to select the file
%  y= [PathName,FileName];
% fid = fopen(y);
fid = fopen('new.txt','r');
words = textscan(fid,'%s');

for i=1:numel(words{1,1})
    ind = find(isstrprop(words{1,1}{i,1}, 'alphanum') == 0);

    words{1,1}{i,1}(ind)=[];
    
end

% Remove entries in words that have zero characters
for i = 1:numel(words{1,1})
    if size(words{1,1}{i,1}, 2) == 0
        words{1,1}{i,1} = ' ';
    end
end

% Now count the number of times each word appears
unique_words = unique(words{1,1});

freq = zeros(numel(unique_words), 1);

for i = 1:numel(unique_words)
    if max(unique_words{i} ~= ' ')
        for j = 1:numel(words{1,1})
            if strcmp(words{1,1}(j), unique_words{i})
                freq(i) = freq(i) + 1;
            end
        end
    end
end


% Finally, print out the results

u_freq = unique(freq);

sorted_freq = sort(u_freq, 'descend');

results={ 'WORD' 'FREQ' 'REL. FREQ' };

for i = 1:min(numel(find(sorted_freq > 1)), 10)
    ind = find(freq == sorted_freq(i));
    results{i+1, 1} = unique_words{ind};

    results{i+1, 2} = unique(freq(ind));
    results{i+1, 3} = sprintf('%.4f%s', unique(freq(ind)/numel(words{1,1}))*100, '%');
end

% sprintf('The words that appeared more than once are displayed below\nTotal number of words in "%s" = %d\nTotal number of unique words = %d', 's2.txt', numel(words{1,1}), numel(find(freq)))
wordcnt= numel(words{1,1});  % -->total number of words
unq_wrdcnt=numel(find(freq));  % -->number of unique words
% 
% sprintf('display Frequency of Word  "%d"',sorted_freq(1))
% sprintf('display word of High Frequency char "%s"',results{2, 1})
sc=results{2, 1};
new_catag={'science','computer','english'};
for k = 1:3
% if(strcmp(sc,new_catag(k))==1)
%     sprintf('This document comes under catagory:%s',sc)
% else
%      sprintf('This document  not comes under catagories that is  defined ')
%     
% end

end
fclose(fid);


end

Code 5 – Function M File – category.m

function [index]=category(result)

summation=sum(result); % to get total number of repetitions for every word in dictionary (columnwise)
[maximum,index]=max(summation);

end

Code 6 – Function M File -check_file.m

function [iteration_num]=check_file(fid,term)

tline = fgetl(fid);
line_num=0;
iteration_num=0;
% ischar(tline)
% ischar(tline)~=0
if ischar(tline)~=0
    while ischar(tline)
        line_string = sprintf('%s',tline);        
        u=strfind(line_string,(char(term)));
        line_num=line_num+1;
        iteration_num = iteration_num + length(u);
        tline = fgetl(fid); %go to next line    
    end
end

fclose(fid);

end

Code 7 – Function M File -file_check.m

function detect=file_check(fid)

[num,txt,raw]=xlsread('dictionary');
[row col]=size(txt);
detect=[];

for i=1:col    
    ln=1;
    for j=1:row
        tline = fgetl(fid);
        detect=[];
        while ischar(tline)    
            line_string = sprintf('%s',tline); % line from the text file
            x=txt(i,j)
            u = strfind(line_string, x); % checking the word from dictionary in the file
            detect=[detect u]
            tline = fgetl(fid); %go to next line
            ln=ln+1
            
        end
    end
end

Code 8 – Function M File -file_loc_inp.m

function fid=file_loc_inp

% fid--> variable in which the fileto be tested is stored
% 'r' --> read operation
% new.txt --> name of the file. it should be present in the same
% directories with other program files. if not, then proper location should
% be provided (only text format allowed)

fid = fopen('new.txt','r');
end

Write Your Comments

Your email address will not be published. Required fields are marked *

Recent Posts

Tags

ad-hoc networks AODV boundary detection process classification clustering clustering algorithm Colour Information computer vision Decryption Encryption EZRP ICM (Iterated Conditional Modes) image denoising image enhancement IMAGE PROCESSING image segmentation Imaging and image processing MANET Markov Random Fields neutrosophic logic optical network proposed method PSNR QLab system region growing Robert’s operator Seed point selection segmentation semi-automatic algorithm Shadow Detection shadow removal wall motion wireless communication Wireless network wireless networks Wireless Sensor Network wireless sensor networks ZRP