Monday, March 12, 2012

multiple text files mining

hey everbody,

i'm absolutely new to any sort of data management

here it goes: suppose we store 100 .txt or .doc files in sql server and we want that none of the files data should match more than 60%: the question which arises are

1. how do we store files in ms-sql (binary format or normal text)?

2. how do we match the files?

3. what code we write in c# for this purpose?

4. has this nething to do with pattern recognition?

My request to all new n active experienced user's to participate because Plzzzzz help me?

1.I should use a SSIS solution using "Import Column transform" to get the files and store in a column with varbinary(max) data type; if you want to store only text in a column you can use varchar(max) data type (you can store maximum 2 GB)

2.Using SSIS solution I told there is a way to matching file (use "for each loop container")

3.For my ideea you don't to do that; you can run periodically the package created with SSIS in a job , depends on your business logic

4.Study the tutorial from here if you want create a text mining project. The ideea if you want to know let say the frequence of the terms /phrases, extracting clustering terms/concepts from the docs.

|||

does SSIS have a readymade package to compare mulitple files? if yes how does one go about it? thanx man !!!

|||

what did ou mean "compare multiple files"?

using for each loop container ou can select *.txt or *.doc files or have a special name.

if mean comparing contents of files i think you can use script task when you can customize this comparison using .net (as you should did it using .NET)

|||

i mean that for example there are two text files (.txt or .doc) stored in SQL server, containing an essay on American Independance:

I want to check that the essays do not match more than 60%. How do i do this? help appreciated !!!

|||

vickwal wrote:

i mean that for example there are two text files (.txt or .doc) stored in SQL server, containing an essay on American Independance:

I want to check that the essays do not match more than 60%. How do i do this? help appreciated !!!

You can use MS Integration Service.

I think you have-to convert (presumably MS Word) .doc format into plain text .txt.

Then you can load .txt files into a table like ESSAYS(AUTHOR varchar(255), FILENAME varchar(255), [Content] TEXT) using For Each Loop Container control.

Then use Fuzzy Lookup comparing by Content field, using same ESSAYS table as base and as Lookup table.

You can play with Similarity Threshold there.

Fuzzy Lookup operator will produce an output for each row of base table where Similarity and Confidence columns will be. Just spool it into another table.

good luck,

Mark

No comments:

Post a Comment