MULTIMODAL TRANSCRIPTION IN DEVELOPING ENGLISH SUBTITLES FOR UZBEK FILMS

In recent years, the increasing availability of audiovisual media across cultural and linguistic borders has intensified the demand for high-quality subtitling. Subtitling is not merely the translation of dialogue from one language into another; it is a complex multimodal process that requires careful consideration of all the meaning-making resources present in an audiovisual text. These resources — known as semiotic modalities — include not only the verbal elements but also visual composition, gestures, music, sound effects, and spatial arrangements. One of the most comprehensive approaches to capturing and analyzing these modalities is multimodal transcription, a method first systematized in audiovisual translation research by Thibault (2000) and further developed by Taylor (2003). Multimodal transcription involves breaking down a film into manageable units — frames, shots, or phases — and examining the contribution of each semiotic modality to the overall construction of meaning. This methodology has proven particularly useful in identifying how verbal elements, such as subtitles, interact with other modes of communication to create coherent, culturally resonant interpretations for target audiences.

While multimodal transcription has been extensively applied to the analysis and subtitling of European films, its use in the context of Uzbek cinema remains largely unexplored. Uzbek films often combine rich visual symbolism, culturally embedded gestures, and traditional musical cues, which together present unique challenges and opportunities for subtitling into English. Applying a multimodal approach to Uzbek audiovisual materials can not only improve subtitle accuracy but also enhance cross-cultural understanding. The topicality of this research is further underscored by recent policy developments in Uzbekistan. The Presidential Decree PF-6084 of 20 October 2020, “On measures to further develop the Uzbek language and improve language policy”[1], and Presidential Decree PF-5653 of 2 February 2019, “On additional measures for the further development of the information sphere and mass communications” [2], have provided a strong legal framework for the advancement of the Uzbek language and communication media. Moreover, the Presidential Resolution PQ-209 of 5 June 2024, “On measures to further develop the cinematography sector and to produce a series of films dedicated to the history of our country” [3], has created a strategic basis for the growth of the film industry and its role in preserving and promoting national culture. These legislative acts highlight the importance of making Uzbek-language audiovisual content accessible to wider audiences, including through high-quality subtitling into foreign languages.

This study adopts Taylor’s (2003) [4] adaptation of multimodal transcription to analyze a selected sequence from the Uzbek film Novda (dir. Zulfiqor Musoqov, 2015) [5]. The aim is twofold: (1) to demonstrate how multimodal transcription can be implemented as a methodological tool in the context of Uzbek cinema, and (2) to formulate practical subtitling strategies that address the interplay of linguistic and non-linguistic meaning-making resources in the chosen sequence. By doing so, the study contributes to the growing field of multimodal translation and provides a replicable framework for the subtitling of culturally specific audiovisual materials.

LITERATURE REVIEW AND METHODOLOGY. Thibault’s (2000) [6] model, originally developed for the analysis of film and television advertising, offers a systematic approach to breaking down an audiovisual text into individual frames, shots, and phases. This method allows for a detailed examination of how meaning is constructed across multiple semiotic modalities. Building on this foundation, Christopher Taylor (2003) [4] addresses the key micro-contextual question of multimodal transcription — namely, how to record and analyze a multimodal product in written form for the purposes of translation and subtitling. Taylor borrows the basic segmentation principle from Thibault but adapts it into a multi-layered, multi-column transcription table that enables researchers to capture visual, auditory, and linguistic elements alongside interpretive commentary. His model, described in Multimodal Transcription in the Analysis, Translation and Subtitling of Italian Films [4, p. 192–193], involves dividing the film into three hierarchical units: Frames, Shots, and Phases, each with specific analytical components.

1. Frames

Frames are the smallest unit in the transcription process. Each frame has two components:

  1. Duration of frame and order of presentation. Frames are numbered individually, and their duration is indicated in seconds. Selection of frames depends on the analytical focus — not every single frame in the film needs to be captured, only those relevant to the study.
  2. Presentation of the visual frames. Still images from the source are included to visually represent the frame being analyzed. Example: A single frame might last 2 seconds, showing a close-up of a character’s face in dim lighting, with a still image extracted to illustrate this moment.

2. Shots

Shots represent a continuous sequence of frames captured from a single camera take. Each shot is analyzed through:

2.1. Components of the visual image. It includes camera position, perspective, focus, distance, salient objects, clothing, and colour schemes.

2.2. Kinesic action of the characters. It refers to gestures, facial expressions, posture, and other bodily movements that convey meaning. Example: In one shot, the camera may use a low-angle perspective to depict authority, while the character folds their arms, signaling defensiveness.

3. Phases

Phases group together shots to reflect broader narrative or communicative units. Each phase is examined in terms of:

3.1. Dialogue and description of the soundtrack. Includes spoken words, intonation, background noises, environmental sounds, and music.

3.2. Metafunctional interpretation of how the film creates meaning. Drawn from Halliday’s systemic-functional linguistics (ideational, interpersonal, and textual meanings) and Kress and van Leeuwen’s (1996/2006) [7] visual grammar. This layer integrates all semiotic modalities — verbal, visual, auditory — to explain how meaning is constructed and communicated. Example: A phase might combine tense dialogue (verbal), rapid editing (visual), and suspenseful strings (auditory) to create an atmosphere of urgency, which supports the ideational and interpersonal functions of the scene.

Taylor’s multimodal transcription model is particularly valuable for subtitling, as it enables the subtitler to identify how verbal elements interact with visual and auditory cues. This allows for more informed decisions on subtitle placement, omission of redundant information, and preservation of non-verbal meaning. However, as Taylor notes, the method is labour-intensive and best suited to theoretical or descriptive studies, rather than large-scale commercial subtitling projects. In the context of the present study, Taylor’s framework provides a systematic basis for analyzing the interaction of semiotic modalities in the Uzbek film Novda, with the aim of developing or checking the developed English subtitles that reflect both verbal and non-verbal aspects of meaning.

RESULTS AND DISCUSSION. Novda is a 2015 Uzbek comedy–melodrama directed by Zulfiqor Musoqov and produced by the Oʻzbekfilm film studio under commission from the Uzbekistan Cinematography Agency. The film features an ensemble cast of well-known Uzbek actors, including Husan Sharipov, Raʼno Zokirova, Shahzoda Matchonova, Farhod Abdullayev, Mirza Azizov, and Behzod Hamroyev [8]. The narrative centers on a harmonious rural family and the relationships among its members, intertwining themes of first love, devotion to one’s homeland, and deep respect for parents. Set against the backdrop of an Uzbek village, the film blends lighthearted comedic elements with heartfelt melodramatic moments, offering a portrayal of everyday life infused with cultural values and moral lessons. Through its use of rich visual imagery, culturally specific gestures, and a soundtrack that reflects traditional Uzbek sensibilities, Novda provides fertile ground for multimodal transcription and analysis, particularly in the context of subtitling for an English-speaking audience. Here in Table 1 you can read the multimodal transcription of scene from the film.

The analysis of the selected scenes from Novda reveals how omissions and incomplete renderings in English subtitles can alter the film’s intended meaning, particularly when examined through a multimodal and metafunctional lens.

The first case (frame 5) — the omission of the young man’s verbal reaction (“Ha?”) to the nurse’s gesture — demonstrates how a small piece of dialogue can carry significant interpersonal metafunctional value. His brief question not only confirms that he did not fully understand the non-verbal cue but also sustains the rhythm of the interaction. By removing this verbal element, the subtitle fails to represent the misunderstanding and subtly shifts the viewer’s interpretation of the relationship between the two characters. The ideational meaning (the nurse signaling him to leave) remains partially intact through visuals, but the interpersonal nuance is diminished.

The second case (frame 10-11) — the incomplete rendering of the doctor’s two-part utterance — highlights how subtitling can disrupt the textual metafunction when temporal and structural cohesion are not preserved. In Uzbek, the doctor’s line “O‘ttiz, qirq … yildan keyin” unfolds in two distinct frames, creating a comedic pause that builds anticipation before delivering the punchline. The English subtitle’s truncated version (“In 30, 40 years”) removes the pause and final phrase, flattening the intended humor. This results in the loss of both ideational meaning (the time reference is incomplete) and interpersonal effect (the playful teasing between characters).

These findings align with previous research suggesting that subtitling cannot be evaluated solely on the accuracy of verbal translation but must also consider how verbal, visual, and kinesic modes interact to create meaning. In both examples, the omissions were only apparent when dialogue was analysed alongside non-verbal cues and shot sequencing, confirming that multimodal transcription offers a more comprehensive method for detecting subtitling issues.

 

 

Table 1. Multimodal transcription of scene from Novda (dir. Zulfiqor Musoqov, 2015).

 

T

Visual frames

Components of the visual image

Kinesitic action

Dialogue

Metafunctional interpretation

Eng subtitile

1

 

CP – Static, outside

HP – Oblique view

VP – Eye level

D – Medium-long shot

VF – The silhouetted figure looking inside

VS – Silhouette, brightly lit interior with people gathered

VC – Interior: people seated on a bed or platform, one wearing a white head covering, others in casual clothes; folded blankets in the background

CR – Exterior: dark blues and blacks; Interior: warm light with whites, beiges, muted colours

CO – Naturalistic

Young boy in Uzbek chopon approaching window, leaning slightly forward to look inside; people inside seated close together, engaged in conversation.

(The sound of the night — crickets chirping, with soft background music played on the piano)

The scene portrays the atmosphere of ordinary village life. It is a quiet night, yet something seems wrong in this large courtyard — the inner room is crowded with people.

(no subtitle as in original)

2

 

CP – Static, inside

HP – Side view

VP – Eye level

D – Medium close-up

VF – Two elderly figures gazing out the window

VS – Elderly man and woman, semi-transparent curtain, window frame

VC – Man in white shirt, woman in white head covering; ceramic bowl on the windowsill

CR – Cool blue lighting dominating the scene, whites in clothing and curtains

CO – Naturalistic

Elderly man standing closely behind an elderly woman, both leaning slightly forward toward the window; their faces turned in the same direction with attentive, intent expressions, suggesting focused observation.

(the  music continues, but without crickets chirping)

Inside are the elders of the family, the most respected figures, and they too are troubled. They glance outside with worry, their expressions showing mutual understanding as they try together to find an answer to a pressing question.

no subtitle as in original)

3

 

CP – Static, inside

HP – Front view

VP – Eye level

D – High close-up

VF – Woman’s facial expression

VC – Man in black chopon in the background

CR – Cool yellow lighting dominating the scene, black in chopon clothing

CO – Naturalistic

The woman is looking intently at a person or object, her full attention fixed. In the background, a boy in a black chopon is half-visible, crouching down.

(the same music in the background) - Oyi?

The woman is trying to grasp what has happened, deeply concerned for her loved one.

- Mom! [9]

5

 

CP – Static, inside

HP – Front view

VP – Eye level

D – High close-up

VF – Woman’s facial expression

VC – two people’s shade in front

CR – Cool yellow lighting dominating the scene, white in uniform clothing

CO – Naturalistic

The nurse is looking at the young man who has entered the room, signalling with her eyes for him to leave.

- Ha?

The doctor’s reply is not meant to alarm the young man, but it is expected to be serious. The nurse signals to the young man to step outside.

(no subtitle)

7

 

CP – Static, inside

HP – Above view

VP – Eye level

D – Very high close-up

VF – Man’s facial expression

VC – white pillow with yellow flowers

CR – Cool yellow lighting dominating the scene, white in the pillow

CO – Naturalistic

The man’s gaze is directed toward the ceiling.

- Qachon o‘laman?

An exhausted man, worn down by a serious illness and resigned to it, asks a question in a faint voice.

- So when am I dying?

8

 

CP – Static, inside

HP – Front view

VP – Eye level

D – Medium close-up

VF – Woman’s facial expression

VC – Man in black chopon in the background

CR – Cool yellow lighting dominating the scene, black in chopon clothing

CO – Naturalistic

The woman’s expression, already surprised by the question she heard, changes even further.

- Voy?! Voy, o‘lmasam!

The woman, not expecting such a question from her husband, clasps her cheeks in surprise. In the background, their son also seems distressed — head lowered, eyes cast to the floor.

- Oh my God!

9

 

CP – Static, inside

HP – Side view

VP – Eye level

D – High close-up

VF – Man’s facial expression

CR – Cool yellow lighting dominating the scene, light blue in uniform clothing

The doctor is silently mouthing a reply to the question.

- O‘ttiz, qirq

The doctor responds to the patient using numbers, yet hides a play on words in his answer.

- In 30, 40 years.

10

 

CP – Static, inside

HP – Front view

VP – Eye level

D – Medium close-up

VF – Woman’s facial expression

VC – Man in black chopon in the background

CR – Cool yellow lighting dominating the scene, black in chopon clothing

Upon hearing the doctor’s answer, they begin to cry.

(crying)

Upon hearing it, the woman and her son are overcome with grief, their eyes filling with tears as they begin to cry.

 

11

 

CP – Static, inside

HP – Side view

VP – Eye level

D – High close-up

VF – Man’s facial expression

CR – Cool yellow lighting dominating the scene, light blue in uniform clothing

The doctor has not yet finished speaking and continues with his explanation.

- yildan keyin.

But the doctor is not finished — his wordplay reveals that the patient will not die yet; in fact, he still has much time to live.

 

12

 

CP – Static, inside

HP – Above view

VP – Eye level

D – Very high close-up

VF – Man’s facial expression

VC – white pillow with yellow flowers

CR – Cool yellow lighting dominating the scene, white in the pillow

CO – Naturalistic

The patient glances toward the doctor only with his eyes.

- Sovuq savolga – sovuq javob!

The patient, however, breaks into a sweat from the doctor’s cold manner of delivering this “joke.” Too weak to lift his head, he can only glance toward the doctor with his eyes.

- Stupid answer to a stupid question!

 

Ultimately, these errors underscore the need for subtitlers to work with a full understanding of the semiotic environment of each scene. By integrating ideational, interpersonal, and textual considerations into the translation process, subtitlers can better preserve the film’s communicative intentions across languages.

This study set out to apply multimodal transcription, following Christopher J. Taylor’s methodology, to the analysis of English subtitles for selected scenes from the Uzbek film Novda (2015), directed by Zulfiqor Musoqov. By integrating verbal dialogue, visual cues, kinesic actions, and metafunctional interpretation (ideational, interpersonal, textual), the research identified subtitling errors that compromised meaning transfer.

Two key issues emerged: (1) Omitted dialogue — the young man’s verbal reaction to the nurse’s gesture was excluded from the subtitle, erasing interpersonal nuance; and (2) Incomplete rendering of wordplay — the doctor’s two-part line was truncated in the English subtitle, disrupting comedic timing and textual cohesion.

These findings demonstrate that certain subtitling issues can only be detected when verbal and non-verbal modes are examined together. The research confirms that multimodal transcription is a valuable tool for improving subtitle accuracy and ensuring that both linguistic content and semiotic context are preserved in translation. Although the limitations of the study include the time-consuming and labor-intensive nature of multimodal transcription, which may not be commercially viable, future studies could apply this methodology to other Uzbek films, contributing to the development of high-quality audiovisual translation in Uzbekistan’s growing film industry.

 

References:

 

1. O‘zbekiston Respublikasi Prezidentining 2020-yil 20-oktabrdagi “Mamlakatimizda o‘zbek tilini yanada rivojlantirish va til siyosatini takomillashtirish chora-tadbirlari to‘g‘risida”gi PF-6084-son Farmoni [Electronic resource] // https://lex.uz/uz/docs/-5058351.

2. O‘zbekiston Respublikasi Prezidentining 2019-yil 2-fevraldagi “Axborot sohasi va ommaviy kommunikatsiyalarni yanada rivojlantirishga oid qo‘shimcha chora-tadbirlar to‘g‘risida”gi PF-5653-son Farmoni [Electronic resource] // https://lex.uz/docs/-4188795.

3. O‘zbekiston Respublikasi Prezidentining 2024-yil 5-iyundagi “Kinematografiya sohasini yanada rivojlantirish hamda mamlakatimiz tarixiga bag‘ishlangan filmlar turkumini yaratishga doir chora-tadbirlar to‘g‘risida”gi PQ-209-son qarori [Electronic resource] // https://www.lex.uz/uz/docs/-6958261.

4. Taylor C.J. Multimodal transcription in the analysis, translation and subtitling of Italian films // Translator. 2003. Vol. 9, № 2.

5. dir. Zulfiqor Musoqov. Novda. Uzbekistan: Oʻzbekfilm, 2015.

6. Thibault P.J. The multimodal transcription of a television advertisement: Theory and practice // Multimodality and multimediality in the distance learning age. 2000.

7. Kress G., Leeuwen T. van. Multimodal Discourse. London: Arnold, 2001.

8. dir. Zulfiqor Musoqov. Novda (film) [Electronic resource] // https://uz.wikipedia.org/wiki/Novda_(film).

9. Novda (with English subtitle). https://www.youtube.com/watch?v=6E7kL6ms-M8. 

Abbreviations:

1. CP – Camera Position

2. HP – Horizontal Perspective

3. VP – Vertical Perspective

4. VF – Visual Focus / Gaze Vectors

5. D – Distance

6. VS – Visually Salient Items

7. VC – Visual Collocation

8. CR – Colours                                  

9. CO – Coding Orientation

 

Bakiev F. O‘zbek filmlari uchun inglizcha subtitrlar ishlab chiqishda multimodal transkripsiyaning o‘rni. Ushbu tadqiqotda Kristofer Teylorning multimodal transkripsiya metodologiyasi qo‘llanilib, Zulfiqor Musoqovning Novda (2015) filmi inglizcha subtitrlari tahlil qilindi. Og‘zaki dialog, vizual tasvirlar, harakatlar va metafunksional talqin integratsiyasi orqali ma’no yetkazilishiga ta’sir qiluvchi subtitr xatolari aniqlandi. Ikki asosiy muammo qayd etildi: (1) qisqa og‘zaki nutqning tushirib qoldirilishi natijasida shaxslararo muloqot nozikligining yo‘qolishi, va (2) tarjima matnida  ikki qismdan iborat gapning to‘liq aks etmasligi sababli so‘z o‘yini  va matn birligi buzilishi. Natijalar shuni ko‘rsatadiki, ayrim subtitr xatolarini faqat verbal va noverbal birliklarni birgalikda tahlil qilish orqali aniqlash mumkin. Bu esa, ko‘p modalli transkripsiyaning subtitr aniqligini oshirish va madaniy xususiyatlarga ega audiovizual kontentning kommunikativ maqsadlarini saqlab qolishdagi ahamiyatini tasdiqlaydi.

 

Бакиев Ф. Мультимодальная транскрипция при создании английских субтитров к узбекским фильмам. В данном исследовании применяется методология мультимодальной транскрипции Кристофера Дж. Тейлора для анализа английских субтитров отдельных сцен узбекского фильма Novda (2015) режиссёра Зулфикора Мусокова. Путём интеграции вербального диалога, визуальных образов, кинесических действий и метафункциональной интерпретации выявлены ошибки субтитрования, влияющие на передачу смысла. Выделены две ключевые проблемы: (1) опущение короткой вербальной реплики, что привело к потере межличностного нюанса, и (2) неполная передача двусоставной фразы, нарушившая комическую паузу и текстовую связность. Результаты подтверждают, что некоторые ошибки субтитрования могут быть выявлены только при совместном анализе вербальных и невербальных модальностей. Это подчёркивает ценность мультимодальной транскрипции как инструмента для повышения точности субтитров и сохранения коммуникативных намерений культурно-специфичного аудиовизуального контента.

 

 

Xorijiy filologiya jurnali tahrir ha'yati