Knowledge Dictionary for Information Extraction on the Arabic Text Data

Wahyu Syaifullah Jauharis Saputra, Master Program Department of Informatics, Faculty of Information Technology, ITS Surabaya, Keputih Sukolilo Surabaya 60111, IndonesiaFollow
Agus Zainal Arifin, Master Program Department of Informatics, Faculty of Information Technology, ITS Surabaya, Keputih Sukolilo Surabaya 60111, Indonesia
Anny Yuniarti, Master Program Department of Informatics, Faculty of Information Technology, ITS Surabaya, Keputih Sukolilo Surabaya 60111, Indonesia

Abstract

Information extraction is an early stage of a process of textual data analysis. Information extraction is required to get information from textual data that can be used for process analysis, such as classification and categorization. A textual data is strongly influenced by the language. Arabic is gaining a significant attention in many studies because Arabic language is very different from others, and in contrast to other languages, tools and research on the Arabic language is still lacking. The information extracted using the knowledge dictionary is a concept of expression. A knowledge dictionary is usually constructed manually by an expert and this would take a long time and is specific to a problem only. This paper proposed a method for automatically building a knowledge dictionary. Dictionary knowledge is formed by classifying sentences having the same concept, assuming that they will have a high similarity value. The concept that has been extracted can be used as features for subsequent computational process such as classification or categorization. Dataset used in this paper was the Arabic text dataset. Extraction result was tested by using a decision tree classification engine and the highest precision value obtained was 71.0% while the highest recall value was 75.0%.

Bahasa Abstract

Knowledge Dictionary untuk Ekstraksi Informasi pada Data Teks Arab. Ekstraksi informasi merupakan sebuah tahap awal dari proses analisis data tekstual. Ekstraksi informasi diperlukan untuk mendapatkan informasi dari data tekstual sehingga dapat digunakan untuk proses analisis seperti misalnya klasifikasi dan kategorisasi. Data tekstual sangat dipengaruhi oleh bahasa, jika sebuah data tekstual berbahasa Arab maka karakter yang digunakan adalah karakter arab. Knowledge dictionary merupakan sebuah kamus yang dapat digunakan untuk mengekstraksi informasi dari data tekstual. Informasi yang diekstraksi menggunakan knowledge dictionary adalah konsep. Knowledge dictionary biasanya dibangun secara manual oleh seorang pakar yang tentunya membutuhkan waktu yang lama dan spesifik untuk setiap masalah. Pada penelitian ini diusulkan sebuah metode untuk membangun knowledge dictionary secara otomatis. Pembentukan knowledge dictionary dilakukan dengan cara mengelompokkan kalimat yang memiliki konsep yang sama, dengan asumsi kalimat yang memiliki konsep yang sama akan memiliki nilai similaritas yang tinggi. Konsep yang telah diekstraksi dapat digunakan sebagai fitur untuk proses komputasi berikutnya misalnya klasifikasi ataupun kategorisasi. Dataset yang digunakan dalam penelitian ini adalah dataset teks Arab. Hasil ekstraksi diuji dengan menggunakan mesin klasifikasi decision tree dan didapatkan nilai presisi tertinggi 71,0% dan nilai recall tertinggi 75,0%.

References

R.J. Mooney, U.Y. Nahm, Proceedings of the 4th International MIDP Colloquium, September 2003, Bloemfontein, South Africa, W. Daelemans, T. du Plessis, C. Snyman, L. Teck (Eds.), Van Schaik Pub., South Africa, 2005, p.141.
N. Kanya, S. Geetha, IET-UK International Conference on Information and Communication
Technology in Electrical Sciences (ICTES 2007), Dr. M.G.R. University, Chennai, Tamil Nadu,
India, 2007, p.1111.
S. Patwardhan, E. Rillof, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, 2007, p.717.
Y. Ichimura, Y. Nakayama, M. Miyoshi, T. Akahane, T. Sekiguchi, Y. Fujiwara, Proceedings
of the 14th Annual Conference of JSAI, Japan, 2000, p.532.
J.-Z. Hu, T. Xu, J.-B. Shu, P. Lu, 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE), Chengdu, China, 2010, p.V4-344. Doi:10.1109/ICACTE.2010.5579485.
J. Zhang, Y. Sun, H. Wang, Y. He, J. Converg. Inf. Technolo. 6/2 (2011) 22.
S. Sakurai, Y. Ichimura, A. Suyama, R. Orihara, IJCAI 2001 Workshop on Text Learning: Beyond Supervision, 2001, p.45.
S. Sakurai, Y. Ichimura, A. Suyama, R. Orihara, ISMIS 2002, LNAI 2366, Springer-Verlag Berling Heidelberg 2002, p.103.
S. Sakurai, A. Suyama, Apll. Soft Comput. 6 (2005) 62.
D. Mona, H. Kadri, J. Daniel, Proceeding HLTNAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers, Stroudsburg, PA, USA, 2004, p.149.
P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison-Wesley, Boston, 2005, p.500.

Recommended Citation

Saputra, Wahyu Syaifullah Jauharis; Arifin, Agus Zainal; and Yuniarti, Anny (2012) "Knowledge Dictionary for Information Extraction on the Arabic Text Data," Makara Journal of Technology: Vol. 16: Iss. 2, Article 13.
DOI: 10.7454/mst.v16i2.1518
Available at: https://scholarhub.ui.ac.id/mjt/vol16/iss2/13

Download

Included in

Chemical Engineering Commons, Civil Engineering Commons, Computer Engineering Commons, Electrical and Electronics Commons, Metallurgy Commons, Ocean Engineering Commons, Structural Engineering Commons

COinS

DOI

https://doi.org/10.7454/mst.v16i2.1518

Knowledge Dictionary for Information Extraction on the Arabic Text Data

Abstract

Bahasa Abstract

References

Recommended Citation

Included in

DOI

Special Issues:

Search

Visitors

Knowledge Dictionary for Information Extraction on the Arabic Text Data

Authors

Abstract

Bahasa Abstract

References

Recommended Citation

Included in

Share

DOI

Special Issues:

Search

Visitors